Skip to main content

DataFrame Diff

Sometimes user has access to two copies of the data one of which may have been modified with new information and user is interested in seeing what has changed. SQL Frames provides a robust data comparision application where any two dataframes can be compared.

Eta

This functionality requires custom table cell renderers which make use of the library Eta. Although Eta is optional, this functionality will make it a required dependency.

The diff API

The DataFrame.diff provides the necessary API to visualize the difference between two DataFrames. The API takes two DataFrames and a list of one or more fields that uniquely identify the rows in the DataFrames. The two DataFrames are referred as dataset A and dataset B.

Unique Id(s)

This approach of doing comparision based on the unique identifiers is advanced, productive and performant than a simple Excel data comparision based on row and column position.

Loading...

Diff Actions

It is possible to augment the Diff dataframe with a set of actions in the menubar and contextual actions in the field headers using ui.diffActions().

The Diff menu allows viewing the following subsets of the data

  1. All - rows present in either A or B
  2. Both - rows present in both A and B
  3. Changed - rows present in both A and B and also there is some difference
  4. Not in both - rows present in only A or only B
  5. Only in A - rows present in A but not in B
  6. Only in B - rows present in B but not in A
Loading...

Diff Sort

The ui.diffActions() API also adds column header level context menu to provide different sort options. The possible options are

  1. Sort by value

    1. Using value from A and then B if A value is null
    2. Using value from B and then A if B value is null
  2. Sort by absolute difference (for number fields)

  3. Sort by absolute percent change (for number fields)

Hierarchical Diff

The best thing with SQL Frames is that the integrated UI allows viewing data in different ways and features such as this data comparision can work on other types visual representation.

UI Actions

All the options to look at different subsets of the compared data and sort options are still available in the hierarchical view.

Loading...

Slice & Compare

It is possible to slice and compare specific data. Current slicer API works on a single DataFrame. This poses a challenge. Also, both datasets being compared may have different list of values for the same field posing another change. The trick to overcome both these challenges is to use a udf, slice the udf and then filter and extract out both the datasets after slicing and then using them to do the diff. An example is shown below with much smaller dataset to understand this advanced usage of SQL Frames low-code API.

Loading...
note

The withRowSetId() API is needed only to generate correct SQL. This may not be necessary in the future. This call can be omitted if SQL generation is not required without impacting the actual data computation.

Filtering Diff DataFrame

Diff dataframe is a special dataframe that can only support filtering using the set operations provided by Diff actions. Ad-hoc filtering similar to a FilteredDataFrame is not possible. Data context menu actions of filtering with equals and not equals are supported but no other filters are supported.

Filter Same Difference

The context menu actions infact allow filtering for rows with the same difference.

Assumptions

This functionality assumes the data to be present in both DataFrames as follows

  1. The unique id(s) are used to match the rows from both the DataFrames. Change to the unique id(s) results in two separate rows one in each DataFrame without a coresponding matching row from the other.
  2. Comparision is performed only among fields with the same name.
  3. Order of the fields doesn't matter.
  4. Each DataFrame can have fields that are only present in it and they are not used for comparision.
  5. In JavaScript NaN is not equal to NaN. Diff considers NaN is equal to NaN to reduce noise.
  6. Percent with base value of 0 is undefined. It is considered as null and the standard SQL null logic is applied fort sorting.
  7. Difference and Percent change are availale only when the entire rows of the respective DataFrames have the number values (or nulls). Mixed data doesn't recognize the fields as number fields and hence these sort metrics are not available.
Same data but different column names

If the same physical column has different names in both the DataFrames, it is still possible to compare them by first projecting with the same name.