An Introduction to High-Utility Itemset Mining

In this blog post, I will give an introduction about a popular problem in data mining, which is called “high-utility itemset mining” or more generally utility mining. I  will give an overview of this problem, explains why it is interesting, and provide source code of Java implementations of the state-of-the-art algorithms for this problem, and datasets.

Frequent itemset mining

The problem of high-utility itemset mining is an extension of the problem of frequent pattern mining. Frequent pattern mining is a popular problem in data mining, which consists in finding frequent patterns in transaction databases. Let me describe first the problem of frequent itemset mining

Consider the following database. It is a transaction database.  A transaction database is a database containing a set of transactions made by customers.  A transaction is a set of items  bought by a customer. For example, in the following database, the first customer bought items “a”, “b”, “c”, “d” and “e”, while the second one bought items “a”, “b” and “e”.

a transaction database

A transaction database

The goal of frequent itemset mining is to find frequent itemsets. Many popular algorithms have been proposed for this problem such as Apriori, FPGrowth, LCM, Eclat, etc.  These algorithms takes as input a transaction database and a parameter “minsup” called the minimum support threshold.  These algorithms then return all set of items (itemsets) that appears in at least minsup transactions. For example, if we set minsup = 2, in our example, we would find several such itemsets (called frequent itemsets) such as the following:

some frequent itemsets

some frequent itemsets

For example, consider the itemset {b,d,e}. It is said to have a support of 3 because it appears in three transactions, and it is said to be frequent because the support of {b,d,e} is no less than minsup.

Frequent itemset mining has some important limitations

The problem of frequent itemset mining is popular. But it has some important limitations when it comes to analyzing customer transactions. An important limitation is that purchase quantities are not taken into account. Thus, an item may only appear once or zero time in a transaction.  Thus, if a customer has bought five breads, ten breads or twenty breads, it is viewed as the same.

A second important limitation is that all items are viewed as having the same importance, utility of weight. For example, if a customer buys a very expensive bottle of wine or a cheap piece of bread, it is viewed as being equally important.

Thus, frequent pattern mining may find many frequent patterns that are not interesting. For example, one may find that {bread, milk} is a frequent pattern. However, from a business perspective, this pattern may be uninteresting because it does not generate much profit. Moreover, frequent pattern mining algorithms may miss the rare patterns that generate a high profit such as perhaps {caviar, wine}.

High-utility itemset mining

To address these limitations, the problem of frequent itemset mining has been redefined as the problem of high-utility itemset mining. In this problem, a transaction database contains transactions where purchase quantities are taken into account as well as the unit profit of each item. For example, consider the following transaction database.

transaction database with quantities

a transaction database with quantities and unit profit information for items

Consider transaction T3.  It indicates that the corresponding customer has bought two units of item “a”, six unit of item “c”, and two units of item “e”.  Now look at the table on the right. This table indicates the unit profit of each item. For example, the unit profit of items “a”, “b”, “c”, “d” and “e”  are respectively 5$, 2$, 1$, 2$ and 3$. This means for example, that each unit of “a” that is sold generates a profit of 5$.

The problem of high-utility itemset mining is to find the itemsets (group of items) that generate a high profit in a database, when they are sold together.  The user has to provide a value for a threshold called “minutil” (the minimum utility threshold). A high-utility itemset mining algorithm outputs all the high-utility itemsets, that is the itemsets that generates at least “minutil” profit.  For example, consider that “minutil” is set to 25 $ by the user. The result of a high utility itemset mining algorithm would be the following.

high-utility itemsets

high-utility itemsets

For example, consider the itemset {b,d}. It is considered to be a high-utility itemset, because it has a utility of 40$ (generates a profit of 40$), which is no less than the minutil threshold that has been set to 25$ by the user. Now, let’s look into more detail about how the utility (profit) of an itemset is calculated. In general, the utility of an itemset in a transaction is the quantity of each item from the itemset multiplied by their unit profit.  For example, consider the figure below. The profit of {a,e} in transaction T0 is   1 x 5 + 1 x 3 = 8 $.  Similarly, the  profit of {a,e} in transaction T3 is   2 x 5 + 2 x 3 = 16 $. Now, the utility of an itemset in the whole database is the sum of its utility in all transactions where it appears. Thus, for {a,e}, its utility is the sum of 8$ + 16 $ = 24$ because it appears only in transactions T0 and transaction T3.

utility of itemset {a,e}

illustration of how to calculate the utility of itemset {a,e}

Why the problem of high-utility itemset mining is interesting?

The problem of high-utility itemset mining is quite interesting for the following reasons.

First, it may be more interesting from a practical perspective to discover itemsets that generate a high profit in customer transactions than those that are bought frequently.

Second, from a research perspective, the problem of high-utility itemset mining is more challenging. In frequent itemset mining, there is a well-known property of the frequency (support) of itemsets that states that given an itemset, all its supersets must have a support that is lower or equal. This is often called the “Apriori property” or “anti-monotonicity” property and is very powerful to prune the search space because if an itemset is infrequent then we know that all its supersets are also infrequent and may  be pruned. In high-utility itemset mining there is no such property.  Thus given an itemset, the utility of its supersets may be higher, lower or the same. For example, in the previous example, the utility of itemsets {a}, {a,e} and {a,b,c} are respectively 20 $, 24$ and 16$.

In this blog post, I will not go into the details of how the high-utility itemset mining algorithms work. But a key idea is to use upper-bounds on the utility of itemsets that restore the anti-monotonicity propety to  be able to prune the search space.

If you want a very detailed survey of high utility itemset mining (2019) that I published on this topic. Also, you may have a look at the book of high utility pattern mining (2019).

Open-source implementations and datasets

There exists several algorithms for high-utility itemset mining that have been proposed over the year. Java implementations of the state-of-the art algorithms are currently offered in my open-source data mining library named SPMF. It offers for example, source-code of Two-Phase (2005), UPGrowth (2011), HUI-Miner (2012) and FHM (2014). To our knowledge, FHM is one of the fastest algorithm for this problem. It was shown to be up to six times faster than HUI-Miner, which was shown to be up to 100 times faster than UPGrowth, which was shown to be up to 1000 times faster than Two-Phase. You can try FHM and the other above algorithms by going to the SPMF website. On the website, you will find instructions about how to run algorithms and some datasets on the dataset page.  Update: Recently, the EFIM algorithm (2015) was proposed and was shown to outperform FHM by up to 1000 times, and is also offered in SPMF.


Note that there also exists several variations of the problem of high-utility itemset mining, which I will not cover in this blog post.


In this blog post, I  gave an overview of the problem definition of high-utility itemset mining. Hope that you enjoyed it. I will try to keep posting blog posts such as this one about interesting data mining problems in the future. Hope that you enjoyed reading 😉

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

(Visited 4,739 times, 2 visits today)


An Introduction to High-Utility Itemset Mining — 149 Comments

  1. Dear Prof. Philippe,

    Thank you for your effort and time to write this post which provides a clear and informative introduction of HUIM. It is really useful for people who initially do research on this problem.

    Recently, many variations of frequent itemsets have been proposed such as high-utility itemsets, erasable itemsets, top-rank-k frequen itemsets, and so on. Do you think frequent itemset mining and its extension is still a hot and common topic with a high potential of publication in good journals/conferences?

    Dang Nguyen

    • Hi Dang,

      Glad you enjoy this blog post. I’m thinking about writting a Part 2 on utility mining in the future.

      I think that it depends. To publish in a good journal/conferences, novel ideas need to be shown. A work that is “incremental” (just combine two ideas in a straightforward way, will usually not be accepted in a good conference and journals. For example, I personally don’t like the problem of erasable itemset because it is not a challenging problem. I had implemented one algorithm for SPMF and it is basically the same thing as Apriori, whithout any challenge. I did not read the recent paper on that topic, but at least the one that I read was straigthforward. In my opinion, high utility mining is probably the best topic among those. It offers some new challenges because we need to use upper-bounds to solve the problem. Of course, it is still possible to do some incremental work on this topic which are not important. But there is also potential to create something new.

      Another possibility is to create some new problems related to pattern mining and show that they are useful and have some real applications (it would be better).

      So I would say, it depends more on what you can do on the topic than the topic itself and some topics are more challenging. Going for a more challenging topic is always better, because it raises challenges and at the same time force to think about new solutions.



  2. Dear Prof. Philippe,

    Thank you for sharing your opinion. I totally agree that publishing papers in high quality journals/conferences often requires novel ideas and significant contributions. That is also one of my goals during my PhD tenure ahead 🙂

    Recently, I am working on the problem of frequent subgraph mining. It would be great if you post an overview of this topic.

    Looking forward to reading the part 2 and hope to see you one day 🙂

    Dang Nguyen

    • Hi Dang,

      You are welcome. Thanks for reading and commenting on the blog 😉

      Subgraph mining looks like a good topic. Graph are a more complicated structure than transactions so there should be more challenges than something more simple like itemset mining. Also, there are less work on subgraph mining in my opinion than itemset mining, so many problems have not been solved for graphs.

      I’m not an expert on graph mining. I have only read a few papers on this topic.

      If you are in Vietnam and want to meet me, I will be at PAKDD 2015 in Ho Chi Minh City for a week in May. Are you in that city? or will you attend this conference?



      • hello sir ,
        I’m little bit confused because i want to do my M-Tech final year project in data mining so as u said high utility item set mining is better to go with or subgraph mining is good what is the complexity of each of the topic and which topic will be welcome by the experts

  3. Dear Sir,
    Can u give me some challenging problems in high utility itemset mining/ high utility sequential pattern mining. Actually I inspired by your research in high utility itemset mining.So I have chosen HUI as my research topic, but now I am unable to define problem statement in it.So Plz Sir can u help me to identify challenging problem in HUI

    • Hi, There is a lot of possible problems on this topics. You can actually combine high utility itemset mining with any other topic from pattern mining to create a new topic. For example, authors have previously combined high utility itemset mining with top-k pattern mining to do top-k high utility pattern mining. This is an example. But you can combine any topic with high utility pattern mining to create a new topic.


    • It would be a little bit long to explain. I would need to write another blog post on this topic but I don’t have time now. Maybe i’ll do it later.

      The main idea is that unlike the support, the utility is not anti-monotonic. Thus, the supersets of an itemset may have a utility that is the same, lower or higher. For this reason, the utility cannot be directly used to prune the search space. To avoid this problems, high utility itemset mining algorithms introduces some upper-bounds. For example, if you read the Two-Phase paper, you will see that they are using the TWU upper bound. The TWU is a measure that is always equal or greater than the utility. This is why it is called an upper-bound. Why is it interesting? Because the TWU is anti-monotonic. Thus it can be used to prune the search space. Besides, the TWU, better upper-bounds have been introduces. For example, in HUI-Miner and FHM, the “remaining utility” is used to calculate an upper-bound


      • Thank u so much for your reply sir, which helps me a lot. Sir, I need some more information about HUI like combinations of HUI. Can you give details about it. I need a specific problem related to it.Plz help me Sir

        • To find topics that are combinations, you would need to search for some papers on pattern mining such as itemset mining and see what has not been done in high utility mining. I cannot really do that for you because it takes time, I’m busy, and actually I can only do that for my own students.

  4. Dear Sir,

    Thank you for such an informative blog. I am doing my research in utility mining. Your blog and your SPMF is really helpful. Thanks for your wonderful contribution.
    I also request you to post more blogs related to negative utility, on shelf utility, incremental utility etc.
    Sir, Can you list any other tools to calculate high utility itemset.


    • Hi, Thanks for reading the blog and commenting. Glad you like it and you are using SPMF. Ok. I will try to write more about high utility pattern mining in the future. As you have seen, I have worked quite a lot on this topic in the recent years, so yes, I could write more on this topic. Now, I’m a little bit busy, but in the next few weeks I might have time for putting more content on this topic. I will remember that request.

      Best regards

  5. Dear authors, anyone able to give me suggestion for finding complexity of algorithms used in utility mining, I.e. UP GROWTH, UPGROWTH + etc etc

  6. Dear Sir,
    How we differentiate utility mining algorithms like UP Growth, UP Growth + etc etc as in P,NP , NP complete or NP Hard?

    • If there is no complexity analysis provided in the paper, you may do your own analysis of the complexity of these algorithms. Personally, I don’t have time to do it. Some algorithms like UPGrowth are based on FPGrowth so you could check the complexity of FPGrowth which has likely been analyzed.
      TwoPhase is based on Apriori. You may get a discussion of the complexity of Apriori here:

      M. Hegland, “The Apriori Algorithm – A Tutorial”, Mathematics and Computation in Imaging Science and Information Processing, vol. 11, pp. 209-262, World
      Scientific Publishing Co., 2007.

      The EFIM algorithm which is the fastest high-utility itemset mining algorithm, do most of its operations in linear time for each pattern in the search space. Thus its complexity is roughly linear with the number of patterns in the search space.

  7. Pingback: Brief report about the 12th International Conference on Machine Learning and Data Mining conference (MLDM 2016) - The Data Mining & Research Blog

  8. Pingback: Brief report about the 16th Industrial Conference on Data mining 2016 (ICDM 2016) - The Data Mining & Research Blog

  9. Respected Sir,

    it is really nice post.
    thanks for clearing very broad topic with few words
    thank you so much sir.
    sir now i am going to start one project for mtech can you please guide me which topic should i prefer.
    I searched many IEEE paper regarding high utility mining.
    sir can you please suggest.

    Thanks and Regards

    • Hi,
      Glad you like this blog post and that it has been useful.

      I think that you could start from the source code of high utility mining algorithms in SPMF. You will then save a lot of time because many algorithms are already implemented and you also have the datasets for doing experiments on the website. Some topics ideas could be to modify the algorithm to add some novel ideas. For example, here are some possibilities:
      – adapting FHM or EFIM to discover rare high utility itemsets by combining ideas from rare itemset mining with high utility mining
      – adapting HUSRM for mining high utility sequential rule with a windows constraint by combining HUSRM with ideas similar to TRuleGrowth
      – adapting the EFIM algorithm for mining the periodic high utility itemsets
      – adapting HUSRM for top k high utility sequential rule mining
      There are a lot of possibilities. I have just listed a few examples.


      • Respected Sir,

        Sorry I am disturbing u.

        But I need your help, regarding explanation of TKU and TKO from your best paper Efficient algorithms for Mining Top-K High utility item-sets.

        I didn’t get the TKU and TKO algorithms actual flow.
        Can u please help me regarding this.

        Thanks and Regards

        • Hello,

          First, to give you some advices, I think that it is not really necessary to understand TKU, which is a quite complex algorithms, since faster and more simple algorithms have been proposed. TKO is faster than TKU. And last year, I participated to another paper where we proposed some algorithm that is faster than TKO:

          Duong, Q.-H., Liao, B., Fournier-Viger, P., Dam, T.-L. (2016). An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies. Knowledge-Based Systems (KBS), Elsevier, 104:106-122.

          So you can first perhaps read that paper. Maybe it is easier to read than TKO, and it is certainly easier to read than the TKU algorithm which is very complex.

          Also, that new algorithm from 2016 and TKO are both inspired by the HUI-Miner algorithms. So you may perhaps want to also read the HUI-Miner algorithm. This should help you the understand the other algorithms that extend HUI-Miner


  10. What are the applications of high utility itemset mining, how it is used in that applications?

    • Hi, The most important application of high utility itemset mining is to analyze customer transaction data. For example, if you have a database of customer transactions, where each transaction indicates the items bought in a store by a customer, you can find the itemsets that generates the most money using high utility itemset mining. This is interesting from a marketing perspective. For examle, if you know that the itemset {tomato,cheese,pita} generates a lot of money, the retail store manager may consider co-promoting these products or offering some discount to try to sell even more of these products. So basically, it can be used to analyze customer behavior by finding the most profitable itemsets. Then, this information can be used to take decision such as decisions related to marketing.

      High utility itemset mining could also be used in other domains. For example, let’s say that you have a factory where each part of some products has a cost. Then assume that some machines assemble the parts to build some products that are sold. You could use high utility itemset mining to find the sets of parts that are the most costly. Then, knowing these, you could try to replace these sets of parts by a single part that is cheaper to produce to reduce the cost of building products. This is just some example that I thought about right now. If you use your imagination you can transpose high utility mining to other problems. For example, in some papers about high utility itemset mining, people often cite that it has been also applied in bioinformatics and for analyzing web page click streams. I did not read these papers though. But you could find them.

      Hope this answers your question.

      • Thank you for your answers. I understood that concept sir. High utility itemset mining used in bioinformatics and web page click streams, i search some papers, but i do not get any description, how it is used, send me the related papers , so kindly send me any papers are there.

        • Some reference are: “mobile commerce (Tseng et al, 2013), click stream analysis (Ahmed et al, 2010; Thilagu et al, 2012), biomedicine (Liu et al, 2013) and cross-marketing (Ahmed et al, 2009; Tseng et al, 2013).”

          Tseng VS, Shie BE, Wu CW, Yu PS (2013) Efficient algorithms for mining high utility itemsets from transactional databases. IEEE Transactions on knowledge and data engineering, 25(8):1772–1786

          Ahmed CF, Tanbeer SK, Jeong B (2010) Mining High Utility Web Access Sequences in Dynamic Web Log Data. Proceedings of the international conference on software engineering artificial intelligence networking and parallel/distributed computing, IEEE, London, UK, June 2010, pp. 76–81

          Thilagu M, Nadarajan R (2012) Efficiently Mining of Effective Web Traversal Patterns With Average Utility. Proceedings of the international conference on communication, computing, and security, CRC Press, Gurgaon, India, September 2016, pp. 444–451

          Liu Y, Cheng C, Tseng VS (2013) Mining Differential Top-k Co-expression Patterns from Time Course Comparative Gene Expression Datasets. BMC Bioinformatics, 14(230)

          Ahmed CF, Tanbeer SK, Jeong BS, Lee, YK (2009) Efficient tree structures for high-utility pattern mining in incremental databases. IEEE Transactions on Knowledge and Data Engineering 21(12):1708–1721

          I did not read these papers recently, so you may have a look by yourself to see the content.

  11. Sir i have one doubt, in utility mining, utility means importance or profitable to whom sir either user or vendor. I hope profitable to the vendor, it is correct or wrong sir. give me some example sir

    • Hello, Normally, it is understood as profitable to the vendor. A high utility itemset is an itemset that generates a lot of profit for the vendor (the retail store).

      But high utility itemset mining is a quite general problem. Basically you have a transaction database that contains transactions with quantities, and where items have unit profit values. You can always transpose this problem to other domains. As I answered you to your other message, it can also be applied to other domain. So, i think it would be possible to also apply it for finding patterns in terms of money saved by the customer, rather than the profit obtained by the vendor. In that case, instead of having profit value for each item, we would store the amount of money saved by a customer for each item bought. The database would hence be different.

  12. what is utility itemset mining with negative item values? give some examples sir. Where it is used?

    • In high utility itemset mining, it is assumed that all items have positive utility values. If we consider applying high utility itemset mining to analyze a customer transaction database, this means that all items must always generate a positive profit. So for example, if a retail store sell some apples, tomatoes, some milk, etc., or any other items, it must always generate a positive profit.

      But this assumption is not realistic in real-life retail store. For example, many big stores such as Wal-Mart and others will sell products at a loss. This means, that they will maybe sell you some milk and lose money (the profit will be negative). The reason is that by selling products at a loss, they will try to make you come to their store and hope that you will buy other more expensive products. If you apply traditional high-utility itemset mining algorithms on a database that contains negative unit profits, it was shown that these algorithm can generate an incorrect result. Thus, some algorithms such as FHN have been designed to discover high utility itemsets in databases which contain positive and negative profit values.

      The link for downloading the conference paper about FHN is here:

      Besides, you can find the source code here:

      And if you want more information, send me an e-mail at philfv8 AT and I will send you a copy of the longer journal paper about FHN that will be published in the Knowledge-Based Systems journal (about 30 pages).

  13. Hai sir
    It is some what difficult for me to understand FHN, please provide me some more information to understand FHN(High utility itemset mining with negative item values) with examples. I sent a mail to you sir.

    • A real dataset is some data that come from the real-life. For example, it can be some data from a retail store.

      A synthetic dataset is some fake data. The dataset has been generated just for the purpose of comparing algorithms. Using synthetic dataset is useful for comparing algorithms because you can increase the size of the dataset for example to see how they perform when the size increase or other characteristics are changed.

      A sparse dataset is a dataset where entries (records) are not very similar with each other. For example, let’s say that you have some data from a retail store and few customers buy the same thing.

      A dense dataset is a dataset where entries (records) are highly similar. For example, let’s say that you have data from a retail store and that everybody buy almost the same thing all the time.

      Typically, when comparing the performance of algorithms (especially for itemset mining and other pattern mining problems), researchers will use both dense and sparse datasets.

  14. how can you obtain datasets for high utility itemset mining with or with out negative values. In spmf, i am not getting full description of the datasets with attribute values, where we are getting datasets with full description such as records,attributes.

    • If you want real shopping data with attribute values, you could look at the Foodmart and Chainstore datasets. Chainstore comes from a database provided with MySQL 2000, while Foodmart comes from the NU MineBench software. On the SPMF website they do not contains attribute values because we have converted these databases to text file. But you could download the original databases and check the attribute values.

      Note that these two datasets have utility values, but may not have negative values, though.

      As for how to obtain MySQL 2000 and NUMinebench you can look on Google. I don’t have any information about how to use them and how they were converted to text file.

  15. What is the difference between on-shelf utility mining and periodic utility mining? both are same or not. In both we are considering time periods.

    • No, they are different. If you read the papers you will find the problem definition of both of these papers.

      Periodic: try to find something that appears periodically: e.g. a customer buy bread every week

      On shelf: the goal is different. The idea is to find what makes the most money as in high utility itemset mining. But it consider the fact that some items are for example only sold during the summer. Thus it consider the shelf time of items when calculating the utility. It consider time but the itemsets found are not necessarily periodic. That is not the goal.

  16. Dear Sir,

    First of all I thank you for your formative blog. I am doing my research in utility mining.
    I would like to know about post utility mining methods i.e after retrieving high utility itemsets. This would help me in my research work. Thanks in advance.

    • Hello,

      Glad you like the blog 😉 I don’t know much work on post utility mining, so I think that you have chosen a good topic. It could be about how to vizualize the itemsets. If you search a little bit about itemset mining in general, association rules and other kinds of patterns, you will see that there are some works about how to visualize the patterns found. This is something that you could do for high utility itemsets. Besides that, it is certainly possible to use the itemset found in other ways for some specific application domains. For example, once the high utility itemsets are found, they could be used for classification, recommendation, etc. But I did not see much work on that. So If you work on this, I think you have many possibilities.


      • Thank you for your suggestion sir. I planed to choose classification based on association rules. Could you please suggest me, what is the purpose of classification based on association rules? and Why we need to classify the rules.

        • Classification is when you have a set of instances and you want to determine to which class they belong to.

          For example, if you have data about some bank customers, you may want to predict if some new customers will belong to the class “good customer” or “bad customer that go bankrupt”.

          In classification by association rules, you do not classify the rules. You are classifying the instances.

          For example, if you have a database about bank customers where you have several attributes such as age, gender, income / year, etc., you could extract some association rules from this database that are specific to the class “bad customer” or “good customer”

          Then you could use these rules to classify some new instances (customers) as either good or bad customers.

          This is the main idea about classification in data mining. We do not classify the rules. But we use the rules for performing classification. There are a lot of other classification models by the way such as decision trees, neural networks, etc.

  17. Dear sir,
    I am not good at programming but highly interested in utility mining. Could you please suggest some areas that doesn’t need implementation.

    • Hello,

      If you do not want to do programming, then the only thing that you could to is apply the existing algorithms on some data to get the result. You can try different algorithms and see what kind of results they generate for your data.

      This can be interesting from an application perspective.

      However, if you are looking for a project for a master degree in computer science, I would not consider this as a good project because a master degree in computer science should involve programming.


  18. Hello,

    Thanks for this blog post.
    Just wondering the scalability of FHM algorithm : is it fine to run on 100millions transaction dataset at reasonable memory/computation time?


    • Hello,

      It would be better to use the EFIM algorithm, which is faster than FHM and uses less memory.

      I am not sure that it would support 100 millions transactions. But you could try with a subset. For example, you might use a week long of transactions or a month instead of using the full dataset, or you could use sampling if needed to reduce the size of the dataset.

      Besides, the performance of the algorithm actually depends on the “mininum utility” parameter. The lower the parameter is set, the more difficult the task becomes. For this type of problem, if you had 100 VERY long transactions, if you set the parameter very low, it may never be able to terminate because of the extremely large search space, but it may also run on millions on transactions without problems if the parameter is higher. So, what I mean is that it depends on the parameter. The search space can increase exponentially when the parameter is lower.

      In experiments described in the research paper, we run these algorithms on about 1 million transactions from a retail store. So I am confident that it could run on a least 1 million transactions.

      By the way, if you want to use this in a live environment, there is many way that the algorithm could be tweaked to improve performances depending on your need. For example, if you add some additional constraints such as a length constraint, the algorithm will become much faster. Besides, the algorithm could be rather easily parallelized if needed.

      Best regards,

  19. Dear sir,
    Could you please suggest applications of utility mining in banking industry. Which property can be taken as utility measure? Thanks in advance.

  20. Sir,
    can you explain about Transaction Weighted Utilization (TWU) and it calculation too I need sir.

    • The TWU calculation is simple. For an itemset X, the TWU of X is the sum of the profit of all transactions containing X.

      For example, if you consider the itemset {b,c,d}. This itemset appears in the transactions T0 and T1 in my example. Hence, the TWU of {b,c,d} will be the sum of the profit of all items in transactions T0 and T1.

      Thus, TWU({b,c,d} = profit of all items in T0 + profit of all items in T1
      = (1 x 5 + 5 x 2 + 1 x 1 + 3 x 2 + 1 x 3) + (4 x 2 + 3 x 1 + 3 x 2 + 1 x 3)

  21. Dear sir,

    I have one doubt in FHM. In the paper FHM you have mentioned pruning strategies of the algorithm for various dataset as s 18% to 91%, 87%, 87 % and 31% to 95% for the Chainstore, BMS, Kosarak and retail datasets. Could you please say how you calculated it.

    Thank you sir.

    • Hello,

      It is simple to calculate this.

      Consider one of the datasets.
      Let’s say that for the lowest minutil values, HUI-Miner has 1000 candidates and FHM has 250 candidates
      Let’s say that for the highest minutil values, HUI-Miner has 100 candidates and FHM has 90 candidates.

      By a simple calculation, you can find that the reduction for the lowest minutil value is 75%, and that the reduction for the highest minutil value is 10 %.

      Thus, I would write that the number of candidates was reduced from 10 % to 75%.

      That is the main idea.

      In this example, I just used any numbers. But in the experiments of the FHM paper, I have collected these numbers from the execution of the algorithm.

      Hope this is clear

    • Hi, I would like to share the code. But I am not the main authors of these papers.
      For TKO, you would need to send an e-mail to Cheng-Wei Wu to ask for the code.
      For kHMC, you would need to send an e-mail to Huy Duong to ask for the code.

      Best regards,

      • Thanks sir for your nice suggestion. Sir could you please send me these two negative utility datasets named T10I4D100K and BMS-POS. These datasets are used in one of your work titled
        “FHN An efficient algorithm for mining high-utility itemsets with negative unit profits”.

  22. Dear sir,
    Could you please explain about performance measures of high utility itemsets, other than execution time, memory space, no. of items,etc..
    Also suggest how to calculate accuracy of itemsets retrieved.
    Thanks in advance.

    • In general, in pattern mining, the goal is to discover some patterns. You can use these patterns to understand the data, and you could also use these patterns to take business decisions, etc.

      The typical measures to evaluate a pattern mining algorithm is speed and memory usage for various parameter values and database size (scalability), and number of patterns. We want to know how fast you can get the patterns and how much memory will be used for various parameters, and how the algorithm will behave for different dataset size.

      Accuracy is a measure typically used in classification to check how many times you have correctly made predictions. In pattern mining, the goal is not to classify or make predictions. Thus, the concept of accuracy does not exist in high utility itemset mining. But if you find a way to use the pattern to make predictions, then you could calculate the accuracy.

  23. Hello Sir
    I did my Masters thesis 3 years ago on High Utility Rare Itemset Mining and finished of with a algorithm that was parallelizable. But at that point of time i was not exposed to parallel programming environments so i just concluded with a serial implementation. Now i have hands on experience of parallel programming models like OpenMP,MPI and CUDA. I was wondering whether i can pursue my doctoral thesis by employing one or more of these parallel computing models for high utility itemset mining? My concern is regarding the worth of this idea. Seek your frank comments Sir .

  24. This is one of the best explanations I’ve ever read. I congratulate you for the great stuff …
    Thank you
    Mahmoud Darvish
    PhD student
    Free University of Iran-Semnan

  25. Dear Sir,
    could you please send me the synthetic dataset for High-Utility Pattern Mining. I am unble to generate these type of datasets as suggested in many research papers.

    • You can generate them using the SPMF software. It offers the synthetic dataset generator.

      After installing the software, in the documentation, you should check examples:

      Example 141 : Generating a synthetic sequence database
      Example 144 : Generating synthetic utility values for a transaction database without utility values

      To generate a new dataset, you should first do Example 141. It will give a transaction database. Then you should do Example 144 to add the utility values to the dataset.

    • Hello, the main application is to analyze customer transactions. With TKU/TKO, you need to set a parameter k. For example, let say that we set k = 100. These algorithms will find the k itemsets (sets of items) that generate the highest profit. For example, it could find that {bread, milk} is one of the most profitable set of items purchased by customers. Having this information, a business manager can take business decisions such as promoting these items to generate more money.

      This is the main application. However, the problem of high utility itemset mining is quite general. A transaction is basically a set of symbols, and utility values are some numbers that are asociated to these symbols. Thus it is easy to apply these algorithms in many other domains. For example, if you have e-learning data, each transaction could be the course taken by a student, and the utility values could be the score that the student received for the course. Then,
      you could analyze this data to find the groups of courses that the student have the highest score (if it would make sense… I don’t know.) But this is just an example. You could do the same with hospital data, etc. So there are many applications.

        • For hospital data, a transaction can be the medecine taken by a patient, and the utility values, the cost of the medicine. I’m not sure if it make sense. But it is just an example to show that it is easy to change the problem to another domain.

          Another example, it could be web log data. In that case a transaction is the webpages viewed by some user, and the utility values are the time spend on each page

  26. Respected Sir,

    Thanks for your simple and neat explanation of utility mining.
    I want to work on post utility mining tasks like using utility item
    set for classification. Suppose I am taking my specific application domain as documents what are the possible utility measures?

    Thanks in advance

  27. Dear Sir,
    could you please describe what will be the profit value of mobile services if we have mobile service accessing sequences instead of transaction.

    Thanks in advance.

    • Hello,

      I do not have datasets that are large enough to be called “big data” for HUIM. Thus, you could always use synthetic datasets by generating them with a tool such as SPMF. Or another solution would be to discuss with real companies and ask them for their data, or convert some other big datasets available on the internet to the format required for HUIM. However, it is not easy to find such datasets. The most simple solution is to use synthetic datasets.

      By the way, when we talk about big data, we should not only think about the amount of data. Actually, in HUIM, the amount of data is not the most difficult part of the problem. The main problem in HUIM is the size of the search space. For example, if you have a database that contains only 1 transactions with 500 items and set minutil = 1, the search space will be 2^500, and the algorithm will never terminate. Thus, even with small data, HUIM is very challenging. And you could have a very large database in HUIM that is not challenging because the search space is small. So what I mean is that if you develop a cloud algorithm for HUIM, it could still be useful even on small data if you can speed up the discovery of patterns. You do not need to have a huge dataset to benefit from a cloud implementation of HUIM algorithms.


  28. Dear sir,
    Can you explain about post mining applications of HUIM? I mean, can we use high uility itemsets to produce classification rules? Will it be beneficial compared to normal FIM?

    • Hello,
      Good question. I think there are many things that could be done with high utility itemsets. It could indeed be used for classification. Or it could be used for other things perhaps like clustering. Or even, it would be possible to develop some visualization tools to visualize the high utility itemsets to support decision making. I think there are many possibilities since the problem of itemset mining can be applied to many domains as the data format is very general. Would high utility itemsets perform better than FIM ? I think that it depends on the application. If the information about the “utility” is important for an application, then I would expect that it could perform better but if the information about the utility is not important for an application, it could perhaps perform worse…. The only way to really know if HUIM is more useful than FIM for a specific application is to try it I guess 😉

  29. Dear Sir,

    I followed this post on high utility. Well explained. Thank you for that example.
    Sir, as high utility is extension of pattern (sequential) mining , do we have any data constraints in high utility too. Like in FIM(Frequent Item Mining), we have constraints like data, measures and models etc. Similarly do we have any constraints same.

    I appreciate your kind reply on this sir. Thank you in advance.


    • Hello, Thanks for reading the blog.

      Yes, just like in frequent itemset mining, we can add constraints to the problem of high utility itemset mining to create some new problems and design new algorithms.

      For example, in a recent papers of mine about the FCHM algorithm, I added the constraint that itemsets must be 1) high utility but must also be (2) correlated using the bond measure. Thus, instead of only using the utility measure, I have added a new constraint that each high utility itemset must have a bond that is greater than some minbond threshold. Thus, this problem has now two constraints : one using the utility measure and one using the bond measure.

      Another example, in a recent papers of mine about the PHM algorithm, I added the constraint that itemsets must be 1) high utility but also must be 2) periodic. Thus, again I extended the concept of high utility patterns with more constraints to propose some new algorithms and write a paper about that.

      In general, adding constraints is a good way of creating new research problems to write papers. You can combine two topics such as Periodic pattern mining + High utility itemset mining to generate a new topic Periodic High utility itemset mining (as I did with the PHM algorithm). Similarly you can combine other topics to generate new topics. But of course, not all topics are interesting. Some topics are combined and are not interesting or difficult. It is better to generate some topics that look difficult or have some novelty or seems useful.

      • Dear Sir,

        Thank you so much for your kind reply. Your answer helped me to get new topic in HUIM area. Also helped me to add some constraints in my on going paper. Really thankful to you.


        • Sir,

          I am thinking to apply HUIM on banking data or health care data. Can it be done? Please give your insights on this.

          Thank you.


          • Hello,

            Yes, HUIM can be applied in many domains. If it is banking data, then the “utility” can be the money. Or the utility could be something else that is measurable with numbers such as the waiting time, the number of transactions, the amount of money that is withdrawn, or anything like that. Basically, you need to think about:
            – what will the items represent for your application?
            – what the utility will represent for your application? (time, money, etc.)
            – what a transaction represents for your application?

            I think there are many ways of applying HUIM in such domains. It depends what you want to find.


  30. Hello,

    Thanks for this blog post.
    It was really informative Sir.
    I’m a under graduate student doing project in high utility pattern mining.
    Could you suggest me a algorithm suitable for feature selection and ranking Sir.

  31. Dear Sir,

    I am Sachin Sharma. I am giving my efforts to generate rare itemsets in data Mining. I have read the multiple papers related to this and found these can be generated by using multiple minimum support threshold. Apart from this, Is there any approach/ method to identify rare itemsets? Can you suggest some latest approach or methodology?

  32. Sir, we are working on high utility sequential pattern mining with negative utility value. Sir could you please send me the datasets for mining high utility sequential pattern mining with negative utility value. Is there any way to generate the said type of datasets. Please help us.

    • Hello,

      This is an interesting topic. There are many sequence datasets on the SPMF website (see the dataset section). Then, you could write a simple program to read these datasets and then randomly generate the utility values with positive and negative values.

      In SPMF, there is some classes to randomly generate datasets and randomly generate utility values. However, the tool for randomly generating utility values in SPMF is designed for transaction databases used in itemset mining. You could check the code and modify those for sequence databases. It is something that could be done very quickly depending on your programming ability.


  33. Sir, Are the “Retail” and “Chainstore” datasets used in bigdata context as the search space explodes exponentially. (This may correspond to the Big Data as the feature space (Item Space) is huge and searching this huge feature space (Item Space) requires large computing power.)

    • Hello, Yes, I agree with you about that. In itemset mining, the size of the search space is a bigger problem than the amount of data. You can have very small datasets that have a huge search space, and large datasets with a small search space. It depends on the type of data and how the user set the parameters of the algorithm. So yes, in some case, a large amount of computing power is needed. And also, in some case, maybe we just want to compute the answer more quickly than using a single computer. So we may use the cloud technology or GPU to solve an itemset mining problem even if the data is small.

      So, if you are writing a paper, you could explain that to justify why you are using the cloud, parallel or distriubted algorithms, or the GPU to find itemsets.

      But still, the term “big data” is generally used to refer to a lot of data. So, if you are using “retail” and “chainstore” you cannot call these datasets big data. But you could say that you use big data technologies to explore a big search space and reduce the time to find a solution.

      • Sir, if you change the format of the original dataset (Which is rather a compressed format) like “Retail” or “Chainstore” into 0/1 format, the dataset itself is taking 100 GB for the “Chainstore” and 2.9 GB for the “Retail” datasets. So, then can we call them Big Data.

        • Hello, That is an interesting observation that I never thought about 😉 But still, although there is maybe no clear definition of big data, many people say that big data is an amount of data that cannot be stored or processed on a single computer. 100 GB can be stored on a modern day computer. Besides, if you are writing a paper, if you send the paper to a good conference or journal, the reviewer may be familiar with itemset mining and may know that these datasets are actually quite small in their original form.

  34. Dear Prof. Viger,

    Could you pls let me know explicitly whether you used big data analytics platform (Hadoop or Spark framework) to mine High utility itemsets from Retail and Chainstore data sets?

    Appreciate your quick response.

    Thx n rgds,

  35. Bonjour Mr Fournier Philipe;

    Je suis nouvelle dans le dataming et surtout dans l’extraction des frequents item.

    Es que c’est possible de me fournir des informations sur ces mesures:
    1) Fréquence d’un motif
    2) Superficie: → () * ()
    3) La moyenne → (.) + (.) / (Étant donné une fonction val: I → R +, on l’étend à un motif X et on note X.val le multiset {val ( i) | i ∈ X} Ce type de fonction est utilisé avec la somme primitive habituelle de type SQL, min et max.Par exemple, sum (X.val) est la somme de val pour chaque élément de X.)
    4)Lien → () / ()

    Que veut dire chacun de ces mesures dans les items? et es qu’il y a d’autre mesures pour de bon items?? dans le dataming toujours.

    Merci d’avance

    • Bonjour,
      La fréquence d’un motif cela signifie généralement le nombre de fois qu’il apparaît dans la base de données. S’il apparaît 5 fois, alors sa fréquence est 5.
      La superficie et la moyenne et lien, je ne suis pas certain. Je pense que cela fait référence à un article spécifique d’un chercheur en particulier. Je ne sais pas de quel article il s’agit.
      Il y a beaucoup de mesures utilisés pour chercher des motifs comme des itemsets, règles d’associations et autres dans les bases de données comme le lift, confidence, support, bond, all-confidence, et plusieurs autres. En fait, il y en a des dizaines. Donc cela dépend de l’article que tu lis.

      Bonne soirée,

  36. Dear Mr Fournier Philipe!
    I am studing about data mining. I read Srikumar Krishnamoorthy ‘s newspaper : “HMiner: Efficiently mining high utility itemsets”.
    Is HMiner better than EFIM?
    Do you have the source code of the Hminer algorithm ?
    (Hminer:Efficiently mining high utility itemsets – Srikumar Krishnamoorthy)
    I’m looking forward to receive your respond. Thank you very much!

    • Hello, Yes I quickly look at that paper a while ago but I do not plan to implement it because there are already many algorithms for that problem in SPMF, and just implementing another algorithm because it seems a little bit faster for some datasets and very low thresholds, i am not very interested. I think that it is more useful to design algorithms for new pattern mining problem than to just implement one more algorithm for the same problem. I think that paper has some new ideas, but the improvement is not that big compared to improvements brought by other algorithms. For example, when UPGrowth was published in 2011, it was about 100 times faster than Two-Phase. When HUI-Miner was published it was maybe 100 times faster, and when EFIM was published it was up to 1000 times faster. But in that new paper, I think that the improvement is not that big when there is one. There are also a few other papers that claim to have made some improvements in the last year but also the improvement is not that big in my opinion. So I prefer to focus on new pattern mining problems that bring value to the user rather than just working on another faster algorithm. I think it is more worthwile but this is just my opinion 😉 Also big conferences like KDD, ICDM etc and top journals generally prefer papers on new problems rather than just faster algorithm, so I focus my research on new problems and prefer to implement algorithms for new problems or at least variations of the problem.

  37. Hi sir,
    What differences are between constraint-based pattern mining and high utility pattern mining ?
    Thank you so much!

  38. Sir, I am thinking about classification on High utility sequential itemsets mining. Could you please suggest me, what are the applications and how to apply classification on high utility sequential itemsets mining.

  39. Dear Sir,

    First, I wish to thank you for this post, bcos it helps me a lot to understand HUIM better.
    And lastly, please I will be glad if I can get some start up links on how to go about improving such kind of algorithms. I am a master student in Computer Science.

    Thanks once again and we need more of this, please.
    Best regards,

    Mustapha Giro

  40. Pingback: Report about the 2018 International Workshop on Mining of Massive Data and IoT - The Data Mining Blog

  41. Pingback: Upcoming book: High Utility Itemset Mining: Theory, Algorithms and Applications - The Data Mining Blog

  42. sir,
    Could you please describe the process of mining, both the frequent and utility patterns.

  43. Sir, where can I find the 21557 transaction Foodmart high utility dataset. Can you please provide the link for the 21557 transaction Foodmart high utility dataset or else email me the dataset.
    Also can you provide me the exact Accidents_10% high utility itemset dataset.
    If possible Sir, can you please email me both the datasets.

    Thanks and Regards
    Gutha Jaya Krishna

    • Hi, The datasets that i have are on the SPMF website in the datasets section. It is possible other people have created a 10% accidents dataset or a Foodmart dataset with 21557 transactions but I don’t know which paper it is. Maybe the best solution is to ask the authors directly. It is also possible that 21557 is an error in the paper if it does not match with the dataset on SPMF. Or there is another reason. If it is my paper, then I don’t remember because I participate to so many papers.

      Thanks for reading the blog.

      Best regards,


  44. Sir, in this paper reference below, 21557 transaction Foodmart and Accidents_10% datasets are used in which you are also one of the co-author. Sir, in any case can you please provide me the two high utility datasets links or provide them in your SPMF datasets page or personally email them to me if you find them.

    Jerry Chun-Wei Lin, Lu Yang, Philippe Fournier-Viger, Tzung-Pei Hong, and Miroslav Voznak, “A Binary PSO Approach to Mine High-Utility Itemsets,” Soft Computing, pp: 1-19, 2016.

    Thanks & Regards
    Gutha Jaya Krishna

    • Hi, I see. In that paper, yes, I am a co-author but I am not the corresponding author. Actually, it is my colleague Prof. Jerry Chun-Wei Lin who did that project with his student. For question about that paper where Prof. Lin is corresponding author, it is better to contact with Prof. Lin directly.

      Best regards,

  45. what is the difference between weighted item set and high utility item set.?
    both are the same or not.?

    • Hi, They are not the same. Weighted itemset is a special case of high utility itemset.
      In Weighted itemset mining, each item has a weight that indicates its relative importance. For example, if an item is a product in a retail store, then the weight of an item “apple” may be its unit profit.
      In High utility itemset mining, each item has a weight, but also items in transactions have quantities. Then, the utility of an item or itemset is calculated by multiplying the quantities by the weights.

      So, weighted itemset mining is a special case of high utility itemset mining where all purchase quantities are 0 or 1.

      Best regards,

  46. Pingback: A brief report about the Big Data Analytics 2019 conference (BDA 2019) - The Data Mining Blog

  47. Please Sir Help
    Identify one application domain of high utility itemset mining with its usage for customers and marketers.

  48. Differentiate between frequent pattern item set and high utility item set mining. identify one application domain of high utility item set mining with its usage for customers and marketers.

      • Respected Sir,
        Sir i Submitted my Answer but i am not satisfied from my Answer.
        Sir i need your help for future understanding.

        “application domain of high utility itemset mining with its usage for customers and marketers.”

  49. Please Sir Help
    I submitted my answer two months before but I’m not satisfied from my answer, please help me for better understanding Sir, I like your articles alot.

    “Differentiate between frequent pattern item set and high utility item set mining. identify one application domain of high utility item set mining with its usage for customers and marketers.”

    • Hi, An application could be product recommendation like on Amazon. If you look at some product A, then Amazon may show you some related products that you may like. If you use frequent itemset mining, the website would recommend products that area frequently purchased together, while if you use high utility itemsets, the website would recommend you products that may not be frequent but yield a high profit.
      I think that this may answer the question… but there are certainly other possible answers. It is kind of an open question…

Leave a Reply

Your email address will not be published. Required fields are marked *