Vertical and horizontal databases in itemset mining

Itemset mining is a data mining task for discovering patterns that appear frequently in transaction databases. In this context, a pattern, also called a frequent itemset, is a set of values that frequently occur together in transactions (records) of a database. Frequent itemset mining has many applications in various fields, but the traditional application is for the analysis of customer transactions. Given a database of customer transactions (sets of purchased items), applying an itemset mining algorithm can reveal the frequent itemsets, that is the set of items that are frequently bought together by customers.

In itemset mining research, two terms are often used, which are the concept of horizontal database and vertical database. They refer to how the data is represented. In this blog post, my goal is to explain what these terms mean.

There are two main ways of representing data for itemset mining. One way is to use a horizontal data format, where each row is a transaction, described by a transaction ID and a set of items in that transaction. For example, this is a horizontal database:

TIDItems
1A,B,C
2B,C,D
3A,C,D
4A,B,D

Here, the first row has the Transaction ID (TID) 1 and indicates that a customer has purchased some items A, B, and C, which could for example represent Apple, Bread and Cake. The other rows follow the same format.

The other main way of representing data for itemset mining is to use a vertical data format, where each column corresponds to an item and gives the list of transaction IDs that contain that item. For example, this is a vertical database:

ItemTID_set
A1,3,4
B1,2,4
C1,2,3
D2,3,4

For example, the first two indicates that the item A appears in the first, third and fourth transactions (that is the transactions with IDs 1, 3 and 4).

These two database formats are two different ways of representing the same data. In other words, it is possible to convert from one format to the other, that is to transform the first table into the second table, or transform the second table to obtain the first table.

The choice of the data format can affect the performance and scalability of itemset mining algorithms. In general, the horizontal data format is more suitable for breadth-first search algorithms, such as the Apriori algorithm, which generates candidate itemsets level by level and scans the database multiple times to count their support. On the other hand, the vertical data format is more suitable for depth-first algorithms, such as the ECLAT algorithm, which generates candidate itemsets by intersecting the sets of transactions containing different itemsets.

Let me explain a little bit more about how the database format influences the design of itemset mining algorithms with an example. Lets say that we want to count the support (the number of occurrences of the itemset {A, B} (the purchase of items A and B together).

To count the support of {A, B} in the first table (using the horizontal format), we need to scan all the transactions from the database and count each one that contains A and B together. After reading the four lines of the first table, we find that {A, B} appears two times (has a support of 2).

Now if we want to count the support of {A, B} in the second table (using the vertical format, instead), it is more simple. We only need to do the intersection of the row of A and the row of B. Let me explain this in detail. In the row of A, we have the list of transactions 1, 3, and 4. In the row of B, we have the list of transactions 1, 2, and 4. By doing the intersection of these two lists, we find that {A,B} appears in transactions 1 and 2, and thus that the support of {A, B} is 2. Thus, if we use the vertical database format, it can be more efficient for counting the support of {A, B} because we do not need to read the whole database but just to look at the rows of A and B. This is one reason why vertical itemset mining algorithms can perform quite well in some situations (but not always).

Besides the horizontal and vertical data format, there of course exists other ways of representing the data from a database. Another way is to use prefix-trees, which is for example the internal data representation adopted by the FP-Growth algorithm.

That is all for day. I just wanted to explain the difference between horizontal and vertical database formats. Hope that this was informative and helpful.


Philippe Fournier-Viger is a professor of computer science and founder of the SPMF data mining library.

This entry was posted in Data Mining, Data science, Pattern Mining and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *