The GROUP BY clause allows aggregating data to compute metrics such as
SQL Frames provides several APIs to handle data aggregation. All the aggregate functions are
SQL.agg object. In addition,
groupBy is accessible from
SQL. SQL Frames
provides advanced group clauses namely
df.gdf() API it is possible to create a Grouped DataFrame.
The GROUP BY clause by default results in creating a single level of detail (LOD). In the example above the LOD is (Region, Country).
The GROUP BY has extensions which provide advanced capabilities. These extended clauses described below allow the creation of multiple levels of details within a single query which many times results in faster execution compared to computing them separately because higher LOD are calculated from lower LOD instead of directly from the source data.
The performance of the
GROUP BY clause depends on the cardinality of the grouping columns.
When grouping based on a single column that has high cardinality or multiple columns whose collective cardinality is high,
it takes more memory and time to perform the grouping operation.
A common mistake when building a drag and drop UI is to allow grouping by a unique id column such as employee id or name. The correct approach to analyzing at the individual record level is to make use of Analytic Functions.
A ROLLUP clause can be used to aggregate data to higher levels. For example, if the above metrics need to be computed not just by country but also region which is a level higher up, then the ROLLUP clause can be used.
A ROLLUP with N fields results in N+1 LOD partitions.
Standard SQL uses
null value for values of dimensions that have been rolled up. Hence it is hard to distinguish
between data having a
null value vs the aggregate row being a rolled up row for the given dimension. SQL Frames
automatically tracks this additional metadata and uses it within the visualizations.
When analyzing data by geography and product category there is no natural hierarchy. These two dimensions are independent and it may be desirable to have aggregates at not only at the base level of country and product category but by country alone and also category alone. In such cases the CUBE clause can be used.
A CUBE with N fields results in 2N LOD partitions.
A CUBE results in several levels of groupings and not all of those may be required. In such cases it is possible to explicitly specify the desired levels of details (LOD).
Using  indicates grand total level where all the provided dimensions are rolled up.
The summarized data can be further visualized as a Hierarchy simply using the
Using the advanced GROUP BY clauses, it is possible to build multi-dimensional Data Cubes that provide the ability to drill-up/down the aggregated data. See Data Cube for more details.