Recursive Hierarchy Flattening
Recursive hierarchies are very common in the enterprise applications. Employee-Manager relationship is the most common recursive hierarchy everyone is familiar with. In addition, Bills of Material for Supply Chain Manufacturing and Territories for Sales are a few other common recursive hierarchies.
Recursive hierarchies pose a challenge for analytics because there are no
simple SQL constructs to create hierarchical aggregate rollups. The Oracle
CONNECT BY/STARTS WITH
clause and the more generic WITH RECURSIVE
CTE
(common table expression) can process data in a top-down or bottom-up manner.
However, to process an aggregate query, the processing of the data needs to
identify the top node which needs to be analyzed and its child nodes need
to be identified one level at a time from top-to-bottom. Aggregation on the
other hand has to happen bottom-up. That is why, most solutions to analyze
data using CTE end up at least two recursive CTEs creating a lot of confusion.
Hierarchy Flattening
A common technique to analyze recursive data is hierarchy flattening. That is, the hierarchy is assumed to be at most of a certain depth and those many columns are explicitly created and the entire hierarchy path is denormalized into these columns. Once this is done, then many hierarchy questions can be answered using the rest of the familiar SQL constructs.
There are no direct SQL constructs to help with the hierarchy flattening. However, SQL Frames provides an API to flatten and denormalize the hierarchy.
Let's take a look at the example below using the df.wrangle.flatten()
API.
Hierarchy Flattening is a special operation that adds dynamic number of columns to the DataFrame. Hence, there are some limitations
- Should not be done on DataFrames that only project columns.
- The parent nodes should be uniquely identifiable.
Hierarchy Aggregation
Now that the table is flattened, let's see how this can then be used for aggregation.
Flattened to Hierarchy
Recursive data can be visualized hierarchically and SQL Frames is smart enough to construct
such recursive hierarchies from data transformed using WITH RECURSIVE
CTE. But flattened recursive
hierarchies can also be visualized hierarchically using the regular fields grouping based hierarchies
created using the hdf()
API.
Even though non-aggregate hiearchical DataFrames store all the data rows at the leaf level, flattened hierarchies do store them in non-leaf node level as appropriate. If the flattened hierarchy has N flattened columns but the hierarchy is based on M fields (M < N) then only the first M-1 recursive rows are stored in the non-leaf nodes.