GROUP BY
The GROUP BY clause allows aggregating data to compute metrics such as COUNT
, SUM
and AVG
.
SQL Frames provides several APIs to handle data aggregation. All the aggregate functions are
accessible from SQL.agg
object. In addition, groupBy
is accessible from SQL
. SQL Frames
provides advanced group clauses namely ROLLUP
, CUBE
and GROUPING SETS
.
Using the df.gdf()
API it is possible to create a Grouped DataFrame.
GROUP BY
The GROUP BY clause by default results in creating a single level of detail (LOD). In the example above the LOD is (Region, Country).
The GROUP BY has extensions which provide advanced capabilities. These extended clauses described below allow the creation of multiple levels of details within a single query which many times results in faster execution compared to computing them separately because higher LOD are calculated from lower LOD instead of directly from the source data.
The performance of the GROUP BY
clause depends on the cardinality of the grouping columns.
When grouping based on a single column that has high cardinality or multiple columns whose collective cardinality is high,
it takes more memory and time to perform the grouping operation.
A common mistake when building a drag and drop UI is to allow grouping by a unique id column such as employee id or name. The correct approach to analyzing at the individual record level is to make use of Analytic Functions.
ROLLUP
A ROLLUP clause can be used to aggregate data to higher levels. For example, if the above metrics need to be computed not just by country but also region which is a level higher up, then the ROLLUP clause can be used.
A ROLLUP with N fields results in N+1 LOD partitions.
Standard SQL uses null
value for values of dimensions that have been rolled up. Hence it is hard to distinguish
between data having a null
value vs the aggregate row being a rolled up row for the given dimension. SQL Frames
automatically tracks this additional metadata and uses it within the visualizations.
CUBE
When analyzing data by geography and product category there is no natural hierarchy. These two dimensions are independent and it may be desirable to have aggregates at not only at the base level of country and product category but by country alone and also category alone. In such cases the CUBE clause can be used.
A CUBE with N fields results in 2N LOD partitions.
GROUPING SETS
A CUBE results in several levels of groupings and not all of those may be required. In such cases it is possible to explicitly specify the desired levels of details (LOD).
Using [] indicates grand total level where all the provided dimensions are rolled up.
Hierarchical View
The summarized data can be further visualized as a Hierarchy simply using the df.hdf()
API.
Data Cube
Using the advanced GROUP BY clauses, it is possible to build multi-dimensional Data Cubes that provide the ability to drill-up/down the aggregated data. See Data Cube for more details.