Data Snapshots
To provide client-side in-memory analytics, data needs to be extracted from the source system and sent to the browser. This data extract can be live or a snapshot. Live extracts typically query directly from the source database and depending on the complexity of the query and the capacity of the database it may or may not be practical. Also, certain types of data may not be frequently changing or in some cases it is immutable. Hence an alternate strategy to live extract is to schedule and create data extracts as snapshots. Further these data snapshots can be stored as files either on the server or on a cloud file system such as Amazon S3.
Multiple Snapshots
Suppose one were to analyze data for the last 10 days and there is a separate snapshot file for each day, then a DataFrame needs to be loaded from multiple files. It is possible that the current day's data is not a snapshot but live data.
SQL Frames provides APIs to load DataFrames. See TABLE.
The DataFrame.fromURL()
function accepts multiple URLs passed as an array.
Below is an example of loading a DataFrame from multiple remote snapshots.
Data Dedupe
Most transactions get updated over a period of time and hence each snapshot may contain different version of the data. In such cases, it is important to deduplicate the data and load it before further data transformations.
The rowid
option indicates the field which can be used to uniquely identify a row in the input data.
The retain
option indicates the strategy of which row to retain in the case of multiple
records with the same rowid
.
Dedupe strategy
There are currently three supported dedupe strategies
first
- This strategy retains the first record among the multiple datasets during the initial load. This is not applicable during the incremental load.last
- This strategy retains the last record among the multiple datasets during the initial load. This is the default strategy.max
- This strategy retains the record which has the max value. This strategy is used both during initial load and incremental load. The fieldmaxFields
should also be specified for this strategy representing one or more fields. The max value is computed as the max value among all the values and that value is compared against the previous max value of the same record.
When max
strategy is used with CSV files, string values are used for comparision during the initial load
and type converted values are using during incremental load. This is because, during the initial load, the
data types of each field are only identified after processing all the data and hence is not available
as data is being parsed and deduplicated.
The string values are compared in a special manner so that both integers and fixed format date and datetimes in canonical format (YYYY-MM-DD and YYYY-MM-DD HH24:MI:SS) can be supported. The logic is to compare the length of the strings and only when they are same, to compare the strings themselves. This ensures that it works for integer values as well.
Dedupe of the records happens both within and across the remote url data sources.