In this blog post, I will give an introduction about a popular problem in data mining, which is called “high-utility itemset mining” or more generally utility mining. I will give an overview of this problem, explains why it is interesting, and provide source code of Java implementations of the state-of-the-art algorithms for this problem, and datasets.
Frequent itemset mining
The problem of high-utility itemset mining is an extension of the problem of frequent pattern mining. Frequent pattern mining is a popular problem in data mining, which consists in finding frequent patterns in transaction databases. Let me describe first the problem of frequent itemset mining
Consider the following database. It is a transaction database. A transaction database is a database containing a set of transactions made by customers. A transaction is a set of items bougth by the customers. For example, in the following database, the first customer bougth items “a”, “b”, “c”, “d” and “e”, while the second one bougth items “a”, “b” and “e”.
The goal of frequent itemset mining is to find frequent itemsets. Many popular algorithms have been proposed for this problem such as Apriori, FPGrowth, LCM, PrePost, FIN, Eclat, etc. These algorithms takes as input a transaction database and a parameter “minsup” called the minimum support threshold. These algorithms then return all set of items (itemsets) that appears in minsup transactions. For example, if we set minsup = 2, in our example, we would find several such itemsets (called frequent itemsets) such as the following:
For example, consider the itemset {b,d,e}. It is said to have a support of 3 because it appears in three transations, and it is said to be frequent because the support of {b,d,e} is no less than minsup.
Frequent itemset mining has some important limitations
The problem of frequent itemset mining is popular. But it has some important limitations when it comes to analyzing customer transactions. An important limitation is that purchase quantities are not taken into account. Thus, an item may only appear once or zero time in a transaction. Thus, if a customer has bougth five breads, ten breads or twenty breads, it is viewed as the same.
A second important limitation is that all items are viewed as having the same importance, utility of weight. For example, if a customer buys a very expensive bottle of wine or just a piece of bread, it is viewed as being equally important.
Thus, frequent pattern mining may find many frequent patterns that are not interesting. For example, one may find that {bread, milk} is a frequent pattern. However, from a business perspective, this pattern may be uninteresting because it does not generate much profit. Moreover, frequent pattern mining algorithms may miss the rare patterns that generate a high profit such as perhaps {caviar, wine}
High-utility itemset mining
To address these limitations, the problem of frequent itemset mining has been redefined as the problem of high-utility itemset mining. In this problem, a transaction database contains transactions where purchase quantities are taken into account as well as the unit profit of each item. For example, consider the following transaction database.
Consider transaction T3. It indicates that the corresponding customer has bougth two units of item “a”, six unit of item “c”, and two units of item “e”. Now look at the table on the right. This table indicates the unit profit of each item. For example, the unit profit of items “a”, “b”, “c”, “d” and “e” are respectively 5$, 2$, 1$, 2$ and 3$. This means for example, that each unit of “a” that is sold generates a profit of 5$.
The problem of high-utility itemset mining is to find the itemsets (group of items) that generate a high profit in a database when they are sold together. The user has to provide a value for a threshold called “minutil” (the minimum utility threshold). A high-utility itemset mining algorithm outputs all the high-utility itemsets, that is those that generates at least “minutil” profit. For example, consider that “minutil” is set to 25 $ by the user. The result of a high utility itemset mining algorithm would be the following.
For example, consider the itemset {b,d}. It is considered to be a high-utility itemset, because it has a utility of 40$ (generates a profit of 40$), which is no less than the minutil threshold that has been set to 25$ by the user. Now, let’s look into more detail about how the utility (profit) of an itemset is calculated. In general, the utility of an itemset in a transaction is the quantity of each item from the itemset multiplied by their unit profit. For example, consider the figure below. The profit of {a,e} in transaction T0 is 1 x 5 + 1 x 3 = 8 $. Similarly, the profit of {a,e} in transaction T3 is 2 x 5 + 2 x 3 = 16 $. Now, the utility of an itemset in the whole database is the sum of its utility in all transactions where it appears. Thus, for {a,e}, its utility is the sum of 8$ + 16 $ = 24$ because it appears only in transactions T0 and transaction T3.
Why the problem of high-utility itemset mining is interesting?
The problem of high-utility itemset mining is quite interesting for the following reasons.
First, it may be more interesting from a practical perspective to discover itemsets that generate a high profit in customer transactions than those that are bougth frequently.
Second, from a research perspective, the problem of high-utility itemset mining is more challenging. In frequent itemset mining, there is a well-known property of the frequency (support) of itemsets that states that given an itemset, all its supersets must have a support that is lower or equal. This is often called the “Apriori property” or “anti-monotonicity” property and is very powerful to prune the search space because if an itemset is infrequent then we know that all its supersets are also infrequent and may be pruned. In high-utility itemset mining there is no such property. Thus given an itemset, the utility of its supersets may be higher, lower or the same. For example, in the previous example, the utility of itemsets {a}, {a,e} and {a,b,c} are respectively 20 $, 24$ and 16$.
In this blog post, I will not go into the details of how the high-utility itemset mining algorithms work. But a key idea is to use upper-bounds on the utility of itemsets that restore the anti-monotonicity propety to be able to prune the search space. This will be the topic of a future blog post.
Open-source implementations and datasets
There exists several algorithms for high-utility itemset mining that have been proposed over the year. Java implementations of the state-of-the art algorithms are currently offered in my open-source data mining library named SPMF. It offers for example, source-code of Two-Phase (2005), UPGrowth (2011), HUI-Miner (2012) and FHM (2014). To our knowledge, FHM is one of the fastest algorithm for this problem. It was shown to be up to six times faster than HUI-Miner, which was shown to be up to 100 times faster than UPGrowth, which was shown to be up to 1000 times faster than Two-Phase. You can try FHM and the other above algorithms by going to the SPMF website. On the website, you will find instructions about how to run algorithms and some datasets on the dataset page. Update: Recently, the EFIM algorithm was proposed and was shown to outperform FHM by up to 1000 times, and is also offered in SPMF.
SPMF also offers many other algorithms to identify other pattern types such as sequential rules, sequential patterns, frequent itemsets and high utility itemsets.
Note that there also exists several variations of the problem of high-utility itemset mining, which I will not cover in this blog post.
To know more: key papers, glossary, and video lectures
You may also be interested in reading my list of key research papers on high utility itemset mining and the glossary of high utility itemset mining, or watch my video lectures on this topic.
Conclusion
In this blog post, I tried to give an overview of the problem of high-utility itemset mining. Hope that you enjoyed it. I will try to keep posting blog posts such as this one about interesting data mining problems in the future. Hope that you enjoyed reading 😉
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.
























