Skip to main content

Double Aggregation

· 5 min read

It is common to pre-aggregate data to improve the performance of generating the reports. While pre-aggregation is in general a good technique, grouping of data by various dimensions and their permutations and time dimension and its various buckets including sliding windows can lead to explosion of pre-aggregated data. Hence the requirements should be carefully evaluated and in some cases it may be better solved using a technique called double aggregation, where data is pre-aggregated to the lowest level first and is then aggregated further as needed on-demand.

The theory

One of the key observations of double aggregating data is that some aggregation operations can be done in stages by computing aggregation by rolling up some dimensions and then aggregate further to get the final result. Mathematically speaking f(x) = g(h(x)) where f, g and h can be same aggregate functions or different ones. This is illustrated with the following example.

Take the list of numbers [1,3,8,2,8,5,9,6,7]. Split this into three sub-lists as [1,3,8], [2,8,5] and [9,6,7]. Now, each of the following metrics can be computed using the double aggregation technique.

AggregateSet 1Set 2Set 3Entire SetOperation
COUNT3339SUM(COUNT(*))
SUM12152249SUM(SUM(x))
AVG457 1/35 4/9SUM(SUM(x))/SUM(COUNT(*))
MIN1261MIN(MIN(x))
MAX8899MAX(MAX(x))

Those familiar with distributed computing systems such as Hadoop should be already familiar with this pattern. The idea there is to scale large amounts of data by dividing it into several smaller data sets, then compute the aggregates on these smaller data sets and then further aggregate the intermediate aggregate results of the smaller data sets to higher levels. The ability to distribute the large data sets into several smaller data sets among hundreds or even thousands of compute nodes is what gives systems like Hadoop, Big Query and Snowflake the ability to scale to large data sets.

SQL Frames can be used to do the higher levels of aggregation from the lower levels within the browser. This allows doing the heavy lifting of aggregating large data sets on the server side either on-demand or by pre-aggregating and then do the rest of the higher levels of aggregation within the browser on the resulting data set that is much smaller. A practical example of this is to show a pivot table with sub-totals at various levels.

Pivot table example

Let's use the Sales Orders data set from the data sets for this example.

Below is the final pivot table with multiple dimensions and rollups including time buckets.

Loading...

Below we show the double-aggregation technique.

First level of aggregate

Here we compute the first level of aggregate within the browser itself, but ideally the corresponding data is loaded from a server. Data is pre-aggregated to month level.

Loading...

Notice how the first level aggregate only contains the lowest level of aggregation namely by region, country and month. Since the data is sparse, it resulted in 100 rows same as the original data set. However, with large data sets, the first level aggregate would still be much smaller, say a few thousand rows.

Second level aggregate

Loading...

Additive Facts

Note that this double aggregation technique is not possible for all types of aggregations. As mentioned in kimball article, facts can be additive, semi-additive and non-additive. The double aggregation technique is meant for additive facts. For example, COUNT(*) is additive while COUNT(DISTINCT x) is not additive.

OLTP+OLAP

MySQL which is traditionally an OLTP database engine is augmented with OLAP capabilities by Oracle with their technology called HeatWave. While amazing technologies like this allow mixing OLTP and OLAP workloads, being able to augment the analytics stack with as much cache and compute layers as possible can improve the overall performance of the system. SQL Frames is a great solution to provide in-memory analytics on top of your existing solutions and improve user experience and reduce TCO.

Conclusion

As shown in this example, using the lowest level of aggregates provided by the server, data could be further aggregated creating multiple levels of detail using SQL Frames within the browser to show a pivot table with sub totals and grand totals at multiple levels. It is also possible to create higher level window function based metrics from the lower level aggregates to provide cumulative totals, moving averages and other such metrics, all without doing any additional work on the server.