Skip to main content

GROUP BY

The GROUP BY clause allows aggregating data to compute metrics such as COUNT, SUM and AVG. SQL Frames provides several APIs to handle data aggregation. All the aggregate functions are accessible from SQL.agg object. In addition, groupBy is accessible from SQL. SQL Frames provides advanced group clauses namely ROLLUP, CUBE and GROUPING SETS.

Using the df.gdf() API it is possible to create a Grouped DataFrame.

GROUP BY

Loading...

The GROUP BY clause by default results in creating a single level of detail (LOD). In the example above the LOD is (Region, Country).

The GROUP BY has extensions which provide advanced capabilities. These extended clauses described below allow the creation of multiple levels of details within a single query which many times results in faster execution compared to computing them separately because higher LOD are calculated from lower LOD instead of directly from the source data.

High cardinality dimensions

The performance of the GROUP BY clause depends on the cardinality of the grouping columns. When grouping based on a single column that has high cardinality or multiple columns whose collective cardinality is high, it takes more memory and time to perform the grouping operation.

A common mistake when building a drag and drop UI is to allow grouping by a unique id column such as employee id or name. The correct approach to analyzing at the individual record level is to make use of Analytic Functions.

ROLLUP

A ROLLUP clause can be used to aggregate data to higher levels. For example, if the above metrics need to be computed not just by country but also region which is a level higher up, then the ROLLUP clause can be used.

A ROLLUP with N fields results in N+1 LOD partitions.

Loading...
Nulls vs Summaries

Standard SQL uses null value for values of dimensions that have been rolled up. Hence it is hard to distinguish between data having a null value vs the aggregate row being a rolled up row for the given dimension. SQL Frames automatically tracks this additional metadata and uses it within the visualizations.

CUBE

When analyzing data by geography and product category there is no natural hierarchy. These two dimensions are independent and it may be desirable to have aggregates at not only at the base level of country and product category but by country alone and also category alone. In such cases the CUBE clause can be used.

A CUBE with N fields results in 2N LOD partitions.

Loading...

GROUPING SETS

A CUBE results in several levels of groupings and not all of those may be required. In such cases it is possible to explicitly specify the desired levels of details (LOD).

Loading...
Grand Total LOD

Using [] indicates grand total level where all the provided dimensions are rolled up.

Hierarchical View

The summarized data can be further visualized as a Hierarchy simply using the df.hdf() API.

Loading...

Data Cube

Using the advanced GROUP BY clauses, it is possible to build multi-dimensional Data Cubes that provide the ability to drill-up/down the aggregated data. See Data Cube for more details.