What is a Closed Itemset and Why is it Useful?

In this blog post, I will explain in simple terms what is a closed itemset and give some examples. I will also mention a few algorithms that can be used to find closed itemsets and that they can be found in the SPMF data mining library in Java.

Frequent itemset mining is a popular technique for discovering interesting relationships between items in data represented as a table. The classical application of frequent itemset mining is to analyze data from a transactional database to find correlations between the items that customers purchase.

An itemset is a set of items that appear together in a transaction or a record. For example, if we have a database of customer transactions, an itemset could be {bread, butter,jam}, which means that some customers have bought bread, butter and jam together.

A frequent itemset is an itemset that appears frequently in a database, that is, its occurrence is above a minimum threshold (minsup) set by the user. For example, if minsup is set to 10%, then a frequent itemset is an itemset that appears in at least 10% of the transactions made by customers. Frequent itemsets have many applications. In the context of analyzing customer transactions, they can be used for marketing, recommendation systems, or cross-selling.

Though frequent itemset mining has many applications, a problem is that it can generate a very large number of frequent itemsets, especially when the database contains many items and transactions. This can make it difficult for users to analyze and interpret the results, and also consume a lot of memory and computational resources.

To address this problem, one can use the concept of frequent closed itemset. A frequent closed itemset is a frequent itemset that has no frequent superset with the same support (appear in the same number of transactions). For example, if {bread, milk} has a support of 15% and {bread, milk, cheese} has a support of 15% as well, then {bread, milk} is not a frequent closed itemset because it has a frequent superset with the same support.

Let’s take an example. Suppose we have a database with four customer transactions, denoted as T1, T2, T3 and T4:

T1: {a,b,c,d}
T2: {a,b,c}
T3: {a,b,d}
T4: {a,b}

where the letters a, b, c, d indicate the purchase of items apple, bread, cake and dattes.

If we set the minimum support threshold to 50% (which means that we want to find itemsets appearing in at least two transactions), a frequent itemset mining algorithm will output the following frequent itemsets:

{a}, {b}, {c}, {d}, {a,b}, {a,c}, {a,d}, {b,c}, {b,d}, {a,b,c}, {a,b,d}

But among these frequent itemsets, only four are frequent closed itemsets:

{a,b}, {a,b,c}, {a,b,d}, {a,b,c,d}

The reason is that these itemsets have no frequent supersets with the same support. For example, {a,b} has a support of 100%, and none of its supersets such as ({a,b,c}, {a,b,d}, {a,b,c,d}) have the same support. On the other hand, {a,c} is not closed because it has a superset ({a,b,c}) with the same support (50%).

Why are closed itemsets useful?

Closed itemsets are important because they can reduce the number of frequent itemsets presented to the user, without losing any information. Frequent itemsets can be very large and redundant, especially when the minsup is low or the database is dense. By mining closed itemsets, generally only a very small set of itemsets is obtained, and still all the other frequent itemsets can be directly derived from the closed itemsets. In other words, any frequent itemset can be derived from a closed itemset by removing some items. For example, we can obtain {a} from {a,b} by removing b, or we can obtain {b,d} from {a,b,c,d} by removing a and c. Therefore, by using closed itemsets, we can reduce the number of itemsets presented to the user without missing any important information.

How to find closed itemsets?

There are several algorithms that have been proposed for this task, such as Apriori-Close, CHARM, FP-Close and LCM. These algorithms are based on different strategies, such as pruning, merging, prefix trees or bitmap representations, to efficiently find all closed itemsets without missing any or generating any false positives. Some of these algorithms have also been adapted to mine maximal itemsets, which are closed itemsets that have no frequent supersets at all.

If you want to try these algorithms on your own data, you can use the SPMF data mining library in Java. SPMF is an open-source library that provides very efficient implementations of various data mining algorithms, including those for closed itemset mining. You can download SPMF from its website (http://www.philippe-fournier-viger.com/spmf/) and follow the documentation and examples to run the algorithms on your data. You can also use SPMF as a command line tool or integrate it with other Java applications, or program written in other programming languages such as C#, Python and R (see the website).

A video

If you are interested by this topic, you can also watch a 50 minute video where I explain in more details what are closed itemsets and their properties, but also maximal itemsets and generator itemsets. This video is part of my free pattern mining course, and you can watch it here: Maximal, Closed and Generator itemsets or from my youtube channel here.

Hope that this blog post has been interesting 🙂

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

This entry was posted in Data Mining, Data science, Pattern Mining and tagged , , , , , , , , , , . Bookmark the permalink.

5 Responses to What is a Closed Itemset and Why is it Useful?

  1. Pingback: What are Generator Itemsets? | The Data Mining Blog

  2. Pingback: What is a maximal itemset? | The Data Mining Blog

  3. Pingback: Free Pattern Mining Course | The Data Blog

  4. Pingback: SPMF 2.48 | The Data Blog

  5. Pingback: The SPMF data mining library v.2.40 is released! | The Data Blog

Leave a Reply

Your email address will not be published. Required fields are marked *