Recursive Hierarchy Flattening
Everyone who studies data structures and algorithms would come across
tree data structure and related algorithms at some point or the other.
The tree traversal algorithms provide the ability to do bottom-up
aggregation, top-down propagation, breadth first, depth first processing
and so on. While SQL provides a standard to process recursive data using
the WITH RECURSIVE
CTE clause, doing any complex analysis on recursive
data using SQL
is like using assembly language to build a web page.
As much as SQL Frames is a fan of SQL
, this is one of the use cases
where it has to breakaway from the limitations of it.
Recursive relational data
Data modeling always involves identifying relationships among data. Most of the time the relationships are between two different tables but occasionally they happen to be in the same table. Such relations are self-referencing. A common example is the employee-manager relationship where a manager is also an employee and hence the employee table would have two records one for the manager and one for the employee and the employee record also has a foreign key pointing to the employee record of the manager.
Recursive data analytics
Let's consider the requirement of computing the payroll expenses of a manager for that person and anyone reporting directly or indirectly to that person. This is a tree-traversal bottom-up aggregation problem starting with that person as the root of the tree.
In a tree-traversal, there is an implicit stack that keeps track of the nodes that have been visited so far and hence when performing the bottom-up aggregation each intermediate node is accessed twice, once to traverse down and once to aggregate bottom-up. Note that it is possible to maintain an accumulator while traversing down and not do bottom up aggregating. However, if the totals have to be computed for each of the intermediate nodes, then the bottom-up aggregation is the only way.
When processing data using WITH RECURSIVE
CTE expression, each row is visited
only once. As a result, doing bottom-up aggregation requires using two WITH RECURSIVE
CTEs
as described in this
stackoverflow answer (which btw seem to work for fixed depth trees). One to identify
all the nodes from within a given root and another to do the rollup. This makes
the queries both complex and slow.
Flattening recursive data
One technique to deal with recursive hierarchies is to flatten the tree-path (root to the node) for each record. For example, a management hierarchy with at most 7 levels deep will have to maintain 7 columns (mgr1, mgr2, ...) and materialize the path information in these columns for each row.
It may be likely possible to compute this in one go and materialize the entire tree by inserting into a new table with corresponding paths materialized but only if the number of levels is assumed to be fixed. But if the levels can be dynamic and no way to know upfront, then the only way I can think of is to do it procedurally.
Flattening in SQL Frames
Today I am happy to announce that SQL Frames has added a new df.wrangle.flatten
API to
create the flattened columns when computing a DataFrame. The hierarchy can have variable depths.
See the documentation for more details.
SQL Frames provides integrated UI to the various types of DataFrames. The flattened recursive
data can then be put together into hierarchical format using the grouping levels based on the
flattened columns acting as the groups. The df.hdf().fh()
API can be used to creating the
grouping columns dynamically based on the max depth of the tree in the source data.
Summary
SQL Frames continues to solve real world hard enterprise problems so your analytics can be elevated to the next level. Reach out to me to learn how SQL Frames can help with your interactive visualization needs.