One of the most important visualization techniques is Pivot Tables. They provide side-by-side comparison of data along both rows and columns. And often the columns end up being the time dimension (year, quarter or month). The power of pivot tables is amplified if there are sub-totals and grand-totals along with just the totals. But the question is whether all this can be done with one single SQL query or at least as fewer queries as possible? If a database supports the CUBE group by clause, such as PostgreSQL and Oracle, then this is simple and can be done in a single query. But what about MySQL the other popular database that only supports ROLLUP group by clause?
The simple answer is that it is possible to mimic the CUBE operation on any database that supports UNION ALL with a single SQL statement. However, this requires a deeper understanding of how the CUBE works and more specifically the
GROUPING in MySQL) works. Very complex GROUP BY clause can be defined using databases like Oracle (SQL Frames supports all that complexity). And the entire complexity can theoretically be translated into a single UNION ALL clause. However, we will go through a simple example of a GROUP BY meant to provide full pivot table (sub-totals along rows, columns and grand total) using just two dimensions one along the rows and the other along the columns.
Pivot table with CUBE
Any set of N elements can be used to create 2^N subsets all of them together is the power set. The CUBE group by clause does exactly that, it acts as a power set operator to create aggregations at all the possible subsets of the given dimensions.
As an example, CUBE(
b) creates aggregates at (
b) level, (
a) level, (
b) level and () level. These are also referred to as levels of detail (LOD) in some products such as Tableau. When we create a pivot table with dimension
a along the rows and dimension
b along the columns, then each of the cells is the (
b) level of detail, the row sums are (
b) LOD, column sums are (
a) LOD and the corner sum at the bottom is the grand total belonging to () LOD. A more generalized pivot table would still contain all the 2^N level totals.
Note that the pre-requisite to creating a pivot table is to aggregate using group by. If the client side rendering engine has awareness of the LODs, then the actual organization of the grouped data into pivot table can be completely done on the client side saving precious database resources.
Below is an example where Region is used as the
a dimension and Year of Order Date as the
Pivot table with ROLLUP
MySQL currently doesn't support the CUBE group by clause. However, it has support for ROLLUP clause since version 5.0 (?). Unlike CUBE which creates the entire power set of levels of details, a ROLLUP only creates (N+1) levels of detail. That is, ROLLUP(a,b) generates (a,b), (a) and () levels of details. This essentially forces one to sacrifice sub-totals either along the rows or columns by using ROLLUP(a,b) or ROLLUP(b,a)
Row sub totals
Only the row sub-totals, that is the total of all the cells within each row, are shown below.
Column sub totals
Only the column sub-totals, that is the total of all the cells within each column, are shown below. This is achieved by just reversing the order of the ROLLUP columns.
Pivot table query with only ROLLUP
Below is how to create a MySQL compatible pivot table query.
The above query is equivalent to the query based on CUBE. SQL Frames Pivot DataFrame currently can only take the entire cube as input to do the rendering of the pivot table entirely on the client-side.
The above way of mimicking CUBE can be generalized to more complex GROUP BY expressions. The key is to understand how the GROUPING_IDs are constructed for each LOD. It is nothing but a simple bit mask of all the terms in the GROUP BY clause and rollup of a dimension is indicated by turning on the corresponding bit. That is why the lowest level is 0 as all the dimensions are present and hence their bits are turned-off.
Performance of Pivot Table Query
Note that using the UNION ALL as shown above works fine but it may not be as optimal as using the CUBE operator. However, it is still far better than some reporting solutions that fire several queries to construct a Pivot Table with sub-totals (see this discussion on Metabase for example).
SQL Frames provides browser based in-memory analytics solution. However, it also has the capability to transpile the DataFrame logic into SQL that can be executed on the server-side. Multiple databases are supported. In addition, SQL Frames even makes it easy to focus on business logic and not worry about the rigid syntax structure of SQL by masking the complexity to end users. All of this makes the auto SQL generation component of SQL Frames a complex piece of technology that includes expansion, minimization and optimization. Other technologies such as query pruning and query rewrites are planned.
At present operators used in DataFrame logic are assumed to be available in the target database (and hence gerated SQL could be incorrect). An even more ambitious plan is to be able to provide suitable SQL transformations to mimick constructs that are missing in a target database as shown above for the CUBE operator.