Finding items that are frequently shopped together is an interesting piece of information. For brick-and-mortar companies this can help to decide which items to offer promotions to attract more customers without comprimising the overall profitability. For online stores, it can help with being able to appropriately cross-sell items.
Frequent itemsets is a well studied data mining technique. Finding frequent pairs is a more specific but important variation since higher cardinality sets can be easily identified using lower cardinality sets. Depending on the size of data, finding the frequent pairs may have to be run on a cluster of computers to just as a single database query! When writing it as a SQL query, care must be taken to avoid performance issues. Below we investigate three different SQL queries all providing the same results.
About the data
We will use the handy Northwind dataset for this analysis. In addtion we will assume that the transaction data only has a reference to the product id and hence it has to be joined to the product table to get the name of the product.
As seen in the
OrderDetails table, each order can have multiple items and hence
one row for each (order,item) combination.
Generating the pairs
The first question that comes to mind is how to generate pairs when each record contains only one item. This can be done by thinking in terms of Sets. Imagine you have a set of items, how do you identify pairs of items from this set? You can imagine that you have two identical sets and you are doing a cartesian product among them. Yes, that is exactly how we will be generating the pairs. This identical sets cartesian product in SQL is obtained using self-joins.
Without worrying about frequency or product names, lets first learn how to generate the pairs of items.
Notice that the
OrderDetails table has 2,155 rows and the generated pairs table has
only 2,452 rows. In general, if there are
N total rows but
O orders with
L items each
(on average) where
L, then the total pairs can be in the order of
Essentially the data can expand by a factor of
Here we first do the self-join and also resolve the item names and then group by to identify the top pairs.
This is similar to the above query except the id to name resolving for products is done after identifying the frequent pairs. Since the pairs list is much smaller than the order details data, the join to resolve the product id is done only when needed and hence can improve performance.
While the query above is better, the cartesian product can still blow up and result in a large number of rows compared to the order details (which itself can be very large). The key insight to reduce the size of the cartesian product is to restrict the product only to items that can potentially end up in the top. That is, if an item is not frequent enough by itself, it can't be frequent enough along with another item.
This is referred to as The A-Priori Algorithm in Frequent Itemsets chapter of MMDS
The timings provided about may not indicate that this 3rd query is better because of the small size of the demo data. For real life queries with lot of data this can make a lot of difference.
Custom frequent item
In the aboe query, the
fi provides all the items that have been bought certain number of
items. But it doesn't have to be this. Say the marketing team wants to run promotions
only for a subset of items, then it can be stored in a table and used in place of the
This provides frequent pair analysis within the business context!
Notice how the DataFrame explorer provided by SQL Frames allows visually comparing the execution plans of queries written in different ways but giving the same results. A good data analyst knows how to write SQLs but a great data analyst knows how to tweak the SQLs for performance and to meet the business requirements with ease.