Skip to main content

Recursive Hierarchy Flattening

· 4 min read

Everyone who studies data structures and algorithms would come across tree data structure and related algorithms at some point or the other. The tree traversal algorithms provide the ability to do bottom-up aggregation, top-down propagation, breadth first, depth first processing and so on. While SQL provides a standard to process recursive data using the WITH RECURSIVE CTE clause, doing any complex analysis on recursive data using SQL is like using assembly language to build a web page.

As much as SQL Frames is a fan of SQL, this is one of the use cases where it has to breakaway from the limitations of it.

Recursive relational data

Data modeling always involves identifying relationships among data. Most of the time the relationships are between two different tables but occasionally they happen to be in the same table. Such relations are self-referencing. A common example is the employee-manager relationship where a manager is also an employee and hence the employee table would have two records one for the manager and one for the employee and the employee record also has a foreign key pointing to the employee record of the manager.

Recursive data analytics

Let's consider the requirement of computing the payroll expenses of a manager for that person and anyone reporting directly or indirectly to that person. This is a tree-traversal bottom-up aggregation problem starting with that person as the root of the tree.

In a tree-traversal, there is an implicit stack that keeps track of the nodes that have been visited so far and hence when performing the bottom-up aggregation each intermediate node is accessed twice, once to traverse down and once to aggregate bottom-up. Note that it is possible to maintain an accumulator while traversing down and not do bottom up aggregating. However, if the totals have to be computed for each of the intermediate nodes, then the bottom-up aggregation is the only way.

When processing data using WITH RECURSIVE CTE expression, each row is visited only once. As a result, doing bottom-up aggregation requires using two WITH RECURSIVE CTEs as described in this stackoverflow answer (which btw seem to work for fixed depth trees). One to identify all the nodes from within a given root and another to do the rollup. This makes the queries both complex and slow.

Flattening recursive data

One technique to deal with recursive hierarchies is to flatten the tree-path (root to the node) for each record. For example, a management hierarchy with at most 7 levels deep will have to maintain 7 columns (mgr1, mgr2, ...) and materialize the path information in these columns for each row.

It may be likely possible to compute this in one go and materialize the entire tree by inserting into a new table with corresponding paths materialized but only if the number of levels is assumed to be fixed. But if the levels can be dynamic and no way to know upfront, then the only way I can think of is to do it procedurally.

Flattening in SQL Frames

Today I am happy to announce that SQL Frames has added a new df.wrangle.flatten API to create the flattened columns when computing a DataFrame. The hierarchy can have variable depths. See the documentation for more details.

SQL Frames provides integrated UI to the various types of DataFrames. The flattened recursive data can then be put together into hierarchical format using the grouping levels based on the flattened columns acting as the groups. The df.hdf().fh() API can be used to creating the grouping columns dynamically based on the max depth of the tree in the source data.

Summary

SQL Frames continues to solve real world hard enterprise problems so your analytics can be elevated to the next level. Reach out to me to learn how SQL Frames can help with your interactive visualization needs.