An Introduction to High-Utility Itemset Mining

In this blog post, I will give an introduction about a popular problem in data mining, which is called “high-utility itemset mining” or more generally utility mining. I  will give an overview of this problem, explains why it is interesting, and provide source code of Java implementations of the state-of-the-art algorithms for this problem, and datasets.

Frequent itemset mining

The problem of high-utility itemset mining is an extension of the problem of frequent pattern mining. Frequent pattern mining is a popular problem in data mining, which consists in finding frequent patterns in transaction databases. Let me describe first the problem of frequent itemset mining

Consider the following database. It is a transaction database.  A transaction database is a database containing a set of transactions made by customers.  A transaction is a set of items  bought by a customer. For example, in the following database, the first customer bought items “a”, “b”, “c”, “d” and “e”, while the second one bought items “a”, “b” and “e”.

a transaction database

A transaction database

The goal of frequent itemset mining is to find frequent itemsets. Many popular algorithms have been proposed for this problem such as Apriori, FPGrowth, LCM, Eclat, etc.  These algorithms takes as input a transaction database and a parameter “minsup” called the minimum support threshold.  These algorithms then return all set of items (itemsets) that appears in at least minsup transactions. For example, if we set minsup = 2, in our example, we would find several such itemsets (called frequent itemsets) such as the following:

some frequent itemsets

some frequent itemsets

For example, consider the itemset {b,d,e}. It is said to have a support of 3 because it appears in three transactions, and it is said to be frequent because the support of {b,d,e} is no less than minsup.

Frequent itemset mining has some important limitations

The problem of frequent itemset mining is popular. But it has some important limitations when it comes to analyzing customer transactions. An important limitation is that purchase quantities are not taken into account. Thus, an item may only appear once or zero time in a transaction.  Thus, if a customer has bought five breads, ten breads or twenty breads, it is viewed as the same.

A second important limitation is that all items are viewed as having the same importance, utility of weight. For example, if a customer buys a very expensive bottle of wine or a cheap piece of bread, it is viewed as being equally important.

Thus, frequent pattern mining may find many frequent patterns that are not interesting. For example, one may find that {bread, milk} is a frequent pattern. However, from a business perspective, this pattern may be uninteresting because it does not generate much profit. Moreover, frequent pattern mining algorithms may miss the rare patterns that generate a high profit such as perhaps {caviar, wine}.

High-utility itemset mining

To address these limitations, the problem of frequent itemset mining has been redefined as the problem of high-utility itemset mining. In this problem, a transaction database contains transactions where purchase quantities are taken into account as well as the unit profit of each item. For example, consider the following transaction database.

transaction database with quantities

a transaction database with quantities and unit profit information for items

Consider transaction T3.  It indicates that the corresponding customer has bought two units of item “a”, six unit of item “c”, and two units of item “e”.  Now look at the table on the right. This table indicates the unit profit of each item. For example, the unit profit of items “a”, “b”, “c”, “d” and “e”  are respectively 5$, 2$, 1$, 2$ and 3$. This means for example, that each unit of “a” that is sold generates a profit of 5$.

The problem of high-utility itemset mining is to find the itemsets (group of items) that generate a high profit in a database, when they are sold together.  The user has to provide a value for a threshold called “minutil” (the minimum utility threshold). A high-utility itemset mining algorithm outputs all the high-utility itemsets, that is the itemsets that generates at least “minutil” profit.  For example, consider that “minutil” is set to 25 $ by the user. The result of a high utility itemset mining algorithm would be the following.

high-utility itemsets

high-utility itemsets

For example, consider the itemset {b,d}. It is considered to be a high-utility itemset, because it has a utility of 40$ (generates a profit of 40$), which is no less than the minutil threshold that has been set to 25$ by the user. Now, let’s look into more detail about how the utility (profit) of an itemset is calculated. In general, the utility of an itemset in a transaction is the quantity of each item from the itemset multiplied by their unit profit.  For example, consider the figure below. The profit of {a,e} in transaction T0 is   1 x 5 + 1 x 3 = 8 $.  Similarly, the  profit of {a,e} in transaction T3 is   2 x 5 + 2 x 3 = 16 $. Now, the utility of an itemset in the whole database is the sum of its utility in all transactions where it appears. Thus, for {a,e}, its utility is the sum of 8$ + 16 $ = 24$ because it appears only in transactions T0 and transaction T3.

utility of itemset {a,e}

illustration of how to calculate the utility of itemset {a,e}

Why the problem of high-utility itemset mining is interesting?

The problem of high-utility itemset mining is quite interesting for the following reasons.

First, it may be more interesting from a practical perspective to discover itemsets that generate a high profit in customer transactions than those that are bought frequently.

Second, from a research perspective, the problem of high-utility itemset mining is more challenging. In frequent itemset mining, there is a well-known property of the frequency (support) of itemsets that states that given an itemset, all its supersets must have a support that is lower or equal. This is often called the “Apriori property” or “anti-monotonicity” property and is very powerful to prune the search space because if an itemset is infrequent then we know that all its supersets are also infrequent and may  be pruned. In high-utility itemset mining there is no such property.  Thus given an itemset, the utility of its supersets may be higher, lower or the same. For example, in the previous example, the utility of itemsets {a}, {a,e} and {a,b,c} are respectively 20 $, 24$ and 16$.

In this blog post, I will not go into the details of how the high-utility itemset mining algorithms work. But a key idea is to use upper-bounds on the utility of itemsets that restore the anti-monotonicity propety to  be able to prune the search space. This will be the topic of a future blog post.

Open-source implementations and datasets

There exists several algorithms for high-utility itemset mining that have been proposed over the year. Java implementations of the state-of-the art algorithms are currently offered in my open-source data mining library named SPMF. It offers for example, source-code of Two-Phase (2005), UPGrowth (2011), HUI-Miner (2012) and FHM (2014). To our knowledge, FHM is one of the fastest algorithm for this problem. It was shown to be up to six times faster than HUI-Miner, which was shown to be up to 100 times faster than UPGrowth, which was shown to be up to 1000 times faster than Two-Phase. You can try FHM and the other above algorithms by going to the SPMF website. On the website, you will find instructions about how to run algorithms and some datasets on the dataset page.  Update: Recently, the EFIM algorithm (2015) was proposed and was shown to outperform FHM by up to 1000 times, and is also offered in SPMF.

spmf

Note that there also exists several variations of the problem of high-utility itemset mining, which I will not cover in this blog post.

Conclusion

In this blog post, I  gave an overview of the problem definition of high-utility itemset mining. Hope that you enjoyed it. I will try to keep posting blog posts such as this one about interesting data mining problems in the future. Hope that you enjoyed reading 😉

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

 

Related posts:

This entry was posted in Data Mining, Research, Utility Mining and tagged , , , , , , , , , . Bookmark the permalink.

97 Responses to An Introduction to High-Utility Itemset Mining

  1. Dang Nguyen says:

    Dear Prof. Philippe,

    Thank you for your effort and time to write this post which provides a clear and informative introduction of HUIM. It is really useful for people who initially do research on this problem.

    Recently, many variations of frequent itemsets have been proposed such as high-utility itemsets, erasable itemsets, top-rank-k frequen itemsets, and so on. Do you think frequent itemset mining and its extension is still a hot and common topic with a high potential of publication in good journals/conferences?

    Best,
    Dang Nguyen

    • Hi Dang,

      Glad you enjoy this blog post. I’m thinking about writting a Part 2 on utility mining in the future.

      I think that it depends. To publish in a good journal/conferences, novel ideas need to be shown. A work that is “incremental” (just combine two ideas in a straightforward way, will usually not be accepted in a good conference and journals. For example, I personally don’t like the problem of erasable itemset because it is not a challenging problem. I had implemented one algorithm for SPMF and it is basically the same thing as Apriori, whithout any challenge. I did not read the recent paper on that topic, but at least the one that I read was straigthforward. In my opinion, high utility mining is probably the best topic among those. It offers some new challenges because we need to use upper-bounds to solve the problem. Of course, it is still possible to do some incremental work on this topic which are not important. But there is also potential to create something new.

      Another possibility is to create some new problems related to pattern mining and show that they are useful and have some real applications (it would be better).

      So I would say, it depends more on what you can do on the topic than the topic itself and some topics are more challenging. Going for a more challenging topic is always better, because it raises challenges and at the same time force to think about new solutions.

      Best,

      Philippe

  2. Dang Nguyen says:

    Dear Prof. Philippe,

    Thank you for sharing your opinion. I totally agree that publishing papers in high quality journals/conferences often requires novel ideas and significant contributions. That is also one of my goals during my PhD tenure ahead 🙂

    Recently, I am working on the problem of frequent subgraph mining. It would be great if you post an overview of this topic.

    Looking forward to reading the part 2 and hope to see you one day 🙂

    Best,
    Dang Nguyen

    • Hi Dang,

      You are welcome. Thanks for reading and commenting on the blog 😉

      Subgraph mining looks like a good topic. Graph are a more complicated structure than transactions so there should be more challenges than something more simple like itemset mining. Also, there are less work on subgraph mining in my opinion than itemset mining, so many problems have not been solved for graphs.

      I’m not an expert on graph mining. I have only read a few papers on this topic.

      If you are in Vietnam and want to meet me, I will be at PAKDD 2015 in Ho Chi Minh City for a week in May. Are you in that city? or will you attend this conference?

      Best,

      Philippe

      • Prajakta N. Kharwandikar. says:

        hello sir ,
        I’m little bit confused because i want to do my M-Tech final year project in data mining so as u said high utility item set mining is better to go with or subgraph mining is good what is the complexity of each of the topic and which topic will be welcome by the experts

    • Prajakta N. Kharwandikar. says:

      sir can i get some input from u regarding subgraph mining

  3. prakash says:

    i searched for a title ,your post is helped me thanks …

  4. lalitha says:

    Dear Sir,
    Can u give me some challenging problems in high utility itemset mining/ high utility sequential pattern mining. Actually I inspired by your research in high utility itemset mining.So I have chosen HUI as my research topic, but now I am unable to define problem statement in it.So Plz Sir can u help me to identify challenging problem in HUI

    • Hi, There is a lot of possible problems on this topics. You can actually combine high utility itemset mining with any other topic from pattern mining to create a new topic. For example, authors have previously combined high utility itemset mining with top-k pattern mining to do top-k high utility pattern mining. This is an example. But you can combine any topic with high utility pattern mining to create a new topic.

      Best,

  5. lalitha says:

    Can u give more details about upper bound of HUI

    • It would be a little bit long to explain. I would need to write another blog post on this topic but I don’t have time now. Maybe i’ll do it later.

      The main idea is that unlike the support, the utility is not anti-monotonic. Thus, the supersets of an itemset may have a utility that is the same, lower or higher. For this reason, the utility cannot be directly used to prune the search space. To avoid this problems, high utility itemset mining algorithms introduces some upper-bounds. For example, if you read the Two-Phase paper, you will see that they are using the TWU upper bound. The TWU is a measure that is always equal or greater than the utility. This is why it is called an upper-bound. Why is it interesting? Because the TWU is anti-monotonic. Thus it can be used to prune the search space. Besides, the TWU, better upper-bounds have been introduces. For example, in HUI-Miner and FHM, the “remaining utility” is used to calculate an upper-bound

      Best,

      • lalitha says:

        Thank u so much for your reply sir, which helps me a lot. Sir, I need some more information about HUI like combinations of HUI. Can you give details about it. I need a specific problem related to it.Plz help me Sir

        • To find topics that are combinations, you would need to search for some papers on pattern mining such as itemset mining and see what has not been done in high utility mining. I cannot really do that for you because it takes time, I’m busy, and actually I can only do that for my own students.

  6. c.sivamathi says:

    Dear Sir,

    Thank you for such an informative blog. I am doing my research in utility mining. Your blog and your SPMF is really helpful. Thanks for your wonderful contribution.
    I also request you to post more blogs related to negative utility, on shelf utility, incremental utility etc.
    Sir, Can you list any other tools to calculate high utility itemset.

    Thanks.

    • Hi, Thanks for reading the blog and commenting. Glad you like it and you are using SPMF. Ok. I will try to write more about high utility pattern mining in the future. As you have seen, I have worked quite a lot on this topic in the recent years, so yes, I could write more on this topic. Now, I’m a little bit busy, but in the next few weeks I might have time for putting more content on this topic. I will remember that request.

      Best regards

  7. Amol j Gosavi says:

    Dear authors, anyone able to give me suggestion for finding complexity of algorithms used in utility mining, I.e. UP GROWTH, UPGROWTH + etc etc

  8. Amol j Gosavi says:

    Dear Sir,
    How we differentiate utility mining algorithms like UP Growth, UP Growth + etc etc as in P,NP , NP complete or NP Hard?

    • If there is no complexity analysis provided in the paper, you may do your own analysis of the complexity of these algorithms. Personally, I don’t have time to do it. Some algorithms like UPGrowth are based on FPGrowth so you could check the complexity of FPGrowth which has likely been analyzed.
      TwoPhase is based on Apriori. You may get a discussion of the complexity of Apriori here:

      M. Hegland, “The Apriori Algorithm – A Tutorial”, Mathematics and Computation in Imaging Science and Information Processing, vol. 11, pp. 209-262, World
      Scientific Publishing Co., 2007.

      The EFIM algorithm which is the fastest high-utility itemset mining algorithm, do most of its operations in linear time for each pattern in the search space. Thus its complexity is roughly linear with the number of patterns in the search space.

  9. Pingback: Brief report about the 12th International Conference on Machine Learning and Data Mining conference (MLDM 2016) - The Data Mining & Research Blog

  10. Pingback: Brief report about the 16th Industrial Conference on Data mining 2016 (ICDM 2016) - The Data Mining & Research Blog

  11. sharda says:

    Respected Sir,

    it is really nice post.
    thanks for clearing very broad topic with few words
    thank you so much sir.
    sir now i am going to start one project for mtech can you please guide me which topic should i prefer.
    I searched many IEEE paper regarding high utility mining.
    sir can you please suggest.

    Thanks and Regards

    • Hi,
      Glad you like this blog post and that it has been useful.

      I think that you could start from the source code of high utility mining algorithms in SPMF. You will then save a lot of time because many algorithms are already implemented and you also have the datasets for doing experiments on the website. Some topics ideas could be to modify the algorithm to add some novel ideas. For example, here are some possibilities:
      – adapting FHM or EFIM to discover rare high utility itemsets by combining ideas from rare itemset mining with high utility mining
      – adapting HUSRM for mining high utility sequential rule with a windows constraint by combining HUSRM with ideas similar to TRuleGrowth
      – adapting the EFIM algorithm for mining the periodic high utility itemsets
      – adapting HUSRM for top k high utility sequential rule mining
      ….
      There are a lot of possibilities. I have just listed a few examples.

      Philippe

      • sharda khode says:

        Respected Sir,

        Sorry I am disturbing u.

        But I need your help, regarding explanation of TKU and TKO from your best paper Efficient algorithms for Mining Top-K High utility item-sets.

        I didn’t get the TKU and TKO algorithms actual flow.
        Can u please help me regarding this.

        Thanks and Regards
        Sharda

        • Hello,

          First, to give you some advices, I think that it is not really necessary to understand TKU, which is a quite complex algorithms, since faster and more simple algorithms have been proposed. TKO is faster than TKU. And last year, I participated to another paper where we proposed some algorithm that is faster than TKO:

          Duong, Q.-H., Liao, B., Fournier-Viger, P., Dam, T.-L. (2016). An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies. Knowledge-Based Systems (KBS), Elsevier, 104:106-122.

          So you can first perhaps read that paper. Maybe it is easier to read than TKO, and it is certainly easier to read than the TKU algorithm which is very complex.

          Also, that new algorithm from 2016 and TKO are both inspired by the HUI-Miner algorithms. So you may perhaps want to also read the HUI-Miner algorithm. This should help you the understand the other algorithms that extend HUI-Miner

          Best,
          Philippe

  12. divvela srinivas rao says:

    What are the applications of high utility itemset mining, how it is used in that applications?

    • Hi, The most important application of high utility itemset mining is to analyze customer transaction data. For example, if you have a database of customer transactions, where each transaction indicates the items bought in a store by a customer, you can find the itemsets that generates the most money using high utility itemset mining. This is interesting from a marketing perspective. For examle, if you know that the itemset {tomato,cheese,pita} generates a lot of money, the retail store manager may consider co-promoting these products or offering some discount to try to sell even more of these products. So basically, it can be used to analyze customer behavior by finding the most profitable itemsets. Then, this information can be used to take decision such as decisions related to marketing.

      High utility itemset mining could also be used in other domains. For example, let’s say that you have a factory where each part of some products has a cost. Then assume that some machines assemble the parts to build some products that are sold. You could use high utility itemset mining to find the sets of parts that are the most costly. Then, knowing these, you could try to replace these sets of parts by a single part that is cheaper to produce to reduce the cost of building products. This is just some example that I thought about right now. If you use your imagination you can transpose high utility mining to other problems. For example, in some papers about high utility itemset mining, people often cite that it has been also applied in bioinformatics and for analyzing web page click streams. I did not read these papers though. But you could find them.

      Hope this answers your question.

      • divvela srinivas rao says:

        Thank you for your answers. I understood that concept sir. High utility itemset mining used in bioinformatics and web page click streams, i search some papers, but i do not get any description, how it is used, send me the related papers , so kindly send me any papers are there.

        • Some reference are: “mobile commerce (Tseng et al, 2013), click stream analysis (Ahmed et al, 2010; Thilagu et al, 2012), biomedicine (Liu et al, 2013) and cross-marketing (Ahmed et al, 2009; Tseng et al, 2013).”

          Tseng VS, Shie BE, Wu CW, Yu PS (2013) Efficient algorithms for mining high utility itemsets from transactional databases. IEEE Transactions on knowledge and data engineering, 25(8):1772–1786

          Ahmed CF, Tanbeer SK, Jeong B (2010) Mining High Utility Web Access Sequences in Dynamic Web Log Data. Proceedings of the international conference on software engineering artificial intelligence networking and parallel/distributed computing, IEEE, London, UK, June 2010, pp. 76–81

          Thilagu M, Nadarajan R (2012) Efficiently Mining of Effective Web Traversal Patterns With Average Utility. Proceedings of the international conference on communication, computing, and security, CRC Press, Gurgaon, India, September 2016, pp. 444–451

          Liu Y, Cheng C, Tseng VS (2013) Mining Differential Top-k Co-expression Patterns from Time Course Comparative Gene Expression Datasets. BMC Bioinformatics, 14(230)

          Ahmed CF, Tanbeer SK, Jeong BS, Lee, YK (2009) Efficient tree structures for high-utility pattern mining in incremental databases. IEEE Transactions on Knowledge and Data Engineering 21(12):1708–1721

          I did not read these papers recently, so you may have a look by yourself to see the content.

  13. divvela srinivas rao says:

    Sir i have one doubt, in utility mining, utility means importance or profitable to whom sir either user or vendor. I hope profitable to the vendor, it is correct or wrong sir. give me some example sir

    • Hello, Normally, it is understood as profitable to the vendor. A high utility itemset is an itemset that generates a lot of profit for the vendor (the retail store).

      But high utility itemset mining is a quite general problem. Basically you have a transaction database that contains transactions with quantities, and where items have unit profit values. You can always transpose this problem to other domains. As I answered you to your other message, it can also be applied to other domain. So, i think it would be possible to also apply it for finding patterns in terms of money saved by the customer, rather than the profit obtained by the vendor. In that case, instead of having profit value for each item, we would store the amount of money saved by a customer for each item bought. The database would hence be different.

  14. divvela srinivas rao says:

    what is utility itemset mining with negative item values? give some examples sir. Where it is used?

    • In high utility itemset mining, it is assumed that all items have positive utility values. If we consider applying high utility itemset mining to analyze a customer transaction database, this means that all items must always generate a positive profit. So for example, if a retail store sell some apples, tomatoes, some milk, etc., or any other items, it must always generate a positive profit.

      But this assumption is not realistic in real-life retail store. For example, many big stores such as Wal-Mart and others will sell products at a loss. This means, that they will maybe sell you some milk and lose money (the profit will be negative). The reason is that by selling products at a loss, they will try to make you come to their store and hope that you will buy other more expensive products. If you apply traditional high-utility itemset mining algorithms on a database that contains negative unit profits, it was shown that these algorithm can generate an incorrect result. Thus, some algorithms such as FHN have been designed to discover high utility itemsets in databases which contain positive and negative profit values.

      The link for downloading the conference paper about FHN is here:
      http://www.philippe-fournier-viger.com/FHN.pdf

      Besides, you can find the source code here:
      http://www.philippe-fournier-viger.com/spmf/

      And if you want more information, send me an e-mail at philfv8 AT yahoo.com and I will send you a copy of the longer journal paper about FHN that will be published in the Knowledge-Based Systems journal (about 30 pages).

  15. divvela srinivas rao says:

    Hai sir
    It is some what difficult for me to understand FHN, please provide me some more information to understand FHN(High utility itemset mining with negative item values) with examples. I sent a mail to you sir.

  16. divvela srinivas rao says:
  17. divvela srinivas rao says:

    what is the difference between real,synthetic, sparse, dense datasets?

    • A real dataset is some data that come from the real-life. For example, it can be some data from a retail store.

      A synthetic dataset is some fake data. The dataset has been generated just for the purpose of comparing algorithms. Using synthetic dataset is useful for comparing algorithms because you can increase the size of the dataset for example to see how they perform when the size increase or other characteristics are changed.

      A sparse dataset is a dataset where entries (records) are not very similar with each other. For example, let’s say that you have some data from a retail store and few customers buy the same thing.

      A dense dataset is a dataset where entries (records) are highly similar. For example, let’s say that you have data from a retail store and that everybody buy almost the same thing all the time.

      Typically, when comparing the performance of algorithms (especially for itemset mining and other pattern mining problems), researchers will use both dense and sparse datasets.

  18. divvela srinivas rao says:

    Thank you for your reply sir. It is very useful for me for comparing algorithms.

  19. divvela srinivas rao says:

    how can you obtain datasets for high utility itemset mining with or with out negative values. In spmf, i am not getting full description of the datasets with attribute values, where we are getting datasets with full description such as records,attributes.

    • If you want real shopping data with attribute values, you could look at the Foodmart and Chainstore datasets. Chainstore comes from a database provided with MySQL 2000, while Foodmart comes from the NU MineBench software. On the SPMF website they do not contains attribute values because we have converted these databases to text file. But you could download the original databases and check the attribute values.

      Note that these two datasets have utility values, but may not have negative values, though.

      As for how to obtain MySQL 2000 and NUMinebench you can look on Google. I don’t have any information about how to use them and how they were converted to text file.

  20. divvela srinivas rao says:

    What is the difference between on-shelf utility mining and periodic utility mining? both are same or not. In both we are considering time periods.

    • No, they are different. If you read the papers you will find the problem definition of both of these papers.

      Periodic: try to find something that appears periodically: e.g. a customer buy bread every week

      On shelf: the goal is different. The idea is to find what makes the most money as in high utility itemset mining. But it consider the fact that some items are for example only sold during the summer. Thus it consider the shelf time of items when calculating the utility. It consider time but the itemsets found are not necessarily periodic. That is not the goal.

  21. Sivamathi says:

    Dear Sir,

    First of all I thank you for your formative blog. I am doing my research in utility mining.
    I would like to know about post utility mining methods i.e after retrieving high utility itemsets. This would help me in my research work. Thanks in advance.

    • Hello,

      Glad you like the blog 😉 I don’t know much work on post utility mining, so I think that you have chosen a good topic. It could be about how to vizualize the itemsets. If you search a little bit about itemset mining in general, association rules and other kinds of patterns, you will see that there are some works about how to visualize the patterns found. This is something that you could do for high utility itemsets. Besides that, it is certainly possible to use the itemset found in other ways for some specific application domains. For example, once the high utility itemsets are found, they could be used for classification, recommendation, etc. But I did not see much work on that. So If you work on this, I think you have many possibilities.

      Regards,

      • Sivamathi says:

        Thank you for your suggestion sir. I planed to choose classification based on association rules. Could you please suggest me, what is the purpose of classification based on association rules? and Why we need to classify the rules.

        • Classification is when you have a set of instances and you want to determine to which class they belong to.

          For example, if you have data about some bank customers, you may want to predict if some new customers will belong to the class “good customer” or “bad customer that go bankrupt”.

          In classification by association rules, you do not classify the rules. You are classifying the instances.

          For example, if you have a database about bank customers where you have several attributes such as age, gender, income / year, etc., you could extract some association rules from this database that are specific to the class “bad customer” or “good customer”

          Then you could use these rules to classify some new instances (customers) as either good or bad customers.

          This is the main idea about classification in data mining. We do not classify the rules. But we use the rules for performing classification. There are a lot of other classification models by the way such as decision trees, neural networks, etc.

  22. Reza says:

    Dear sir,
    I am not good at programming but highly interested in utility mining. Could you please suggest some areas that doesn’t need implementation.

    • Hello,

      If you do not want to do programming, then the only thing that you could to is apply the existing algorithms on some data to get the result. You can try different algorithms and see what kind of results they generate for your data.

      This can be interesting from an application perspective.

      However, if you are looking for a project for a master degree in computer science, I would not consider this as a good project because a master degree in computer science should involve programming.

      Best,
      Philippe

  23. Brookm says:

    Hello,

    Thanks for this blog post.
    Just wondering the scalability of FHM algorithm : is it fine to run on 100millions transaction dataset at reasonable memory/computation time?

    Thanks

    • Hello,

      It would be better to use the EFIM algorithm, which is faster than FHM and uses less memory.

      I am not sure that it would support 100 millions transactions. But you could try with a subset. For example, you might use a week long of transactions or a month instead of using the full dataset, or you could use sampling if needed to reduce the size of the dataset.

      Besides, the performance of the algorithm actually depends on the “mininum utility” parameter. The lower the parameter is set, the more difficult the task becomes. For this type of problem, if you had 100 VERY long transactions, if you set the parameter very low, it may never be able to terminate because of the extremely large search space, but it may also run on millions on transactions without problems if the parameter is higher. So, what I mean is that it depends on the parameter. The search space can increase exponentially when the parameter is lower.

      In experiments described in the research paper, we run these algorithms on about 1 million transactions from a retail store. So I am confident that it could run on a least 1 million transactions.

      By the way, if you want to use this in a live environment, there is many way that the algorithm could be tweaked to improve performances depending on your need. For example, if you add some additional constraints such as a length constraint, the algorithm will become much faster. Besides, the algorithm could be rather easily parallelized if needed.

      Best regards,

  24. sivamathi says:

    Dear sir,
    Could you please suggest applications of utility mining in banking industry. Which property can be taken as utility measure? Thanks in advance.

  25. LALITHA J says:

    Thank you so much sir..
    your way of explanation is too good

  26. honey says:

    Anybody help me how to use Preprocessing techniques in utility mining.Im new to this

  27. LALITHA J says:

    Sir,
    can you explain about Transaction Weighted Utilization (TWU) and it calculation too I need sir.

    • The TWU calculation is simple. For an itemset X, the TWU of X is the sum of the profit of all transactions containing X.

      For example, if you consider the itemset {b,c,d}. This itemset appears in the transactions T0 and T1 in my example. Hence, the TWU of {b,c,d} will be the sum of the profit of all items in transactions T0 and T1.

      Thus, TWU({b,c,d} = profit of all items in T0 + profit of all items in T1
      = (1 x 5 + 5 x 2 + 1 x 1 + 3 x 2 + 1 x 3) + (4 x 2 + 3 x 1 + 3 x 2 + 1 x 3)

  28. LALITHA J says:

    thank you sir

  29. TR RAO says:

    Dear Sir,
    Can I get code for TFP : Mining frequent closed itemsets using FP growth. ?

  30. siva says:

    Dear sir,

    I have one doubt in FHM. In the paper FHM you have mentioned pruning strategies of the algorithm for various dataset as s 18% to 91%, 87%, 87 % and 31% to 95% for the Chainstore, BMS, Kosarak and retail datasets. Could you please say how you calculated it.

    Thank you sir.

    • Hello,

      It is simple to calculate this.

      Consider one of the datasets.
      Let’s say that for the lowest minutil values, HUI-Miner has 1000 candidates and FHM has 250 candidates
      Let’s say that for the highest minutil values, HUI-Miner has 100 candidates and FHM has 90 candidates.

      By a simple calculation, you can find that the reduction for the lowest minutil value is 75%, and that the reduction for the highest minutil value is 10 %.

      Thus, I would write that the number of candidates was reduced from 10 % to 75%.

      That is the main idea.

      In this example, I just used any numbers. But in the experiments of the FHM paper, I have collected these numbers from the execution of the algorithm.

      Hope this is clear

  31. Kuldeep Singh says:

    can i get code for top-k based one-phase algorithms such as TKO and kHMC algorithms.

    • Hi, I would like to share the code. But I am not the main authors of these papers.
      For TKO, you would need to send an e-mail to Cheng-Wei Wu to ask for the code.
      For kHMC, you would need to send an e-mail to Huy Duong to ask for the code.

      Best regards,

      • Kuldeep Singh says:

        Thanks sir for your nice suggestion. Sir could you please send me these two negative utility datasets named T10I4D100K and BMS-POS. These datasets are used in one of your work titled
        “FHN An efficient algorithm for mining high-utility itemsets with negative unit profits”.

  32. c.sivamathi says:

    Dear sir,
    Could you please explain about performance measures of high utility itemsets, other than execution time, memory space, no. of items,etc..
    Also suggest how to calculate accuracy of itemsets retrieved.
    Thanks in advance.

    • In general, in pattern mining, the goal is to discover some patterns. You can use these patterns to understand the data, and you could also use these patterns to take business decisions, etc.

      The typical measures to evaluate a pattern mining algorithm is speed and memory usage for various parameter values and database size (scalability), and number of patterns. We want to know how fast you can get the patterns and how much memory will be used for various parameters, and how the algorithm will behave for different dataset size.

      Accuracy is a measure typically used in classification to check how many times you have correctly made predictions. In pattern mining, the goal is not to classify or make predictions. Thus, the concept of accuracy does not exist in high utility itemset mining. But if you find a way to use the pattern to make predictions, then you could calculate the accuracy.

  33. Sudip says:

    Hello Sir
    I did my Masters thesis 3 years ago on High Utility Rare Itemset Mining and finished of with a algorithm that was parallelizable. But at that point of time i was not exposed to parallel programming environments so i just concluded with a serial implementation. Now i have hands on experience of parallel programming models like OpenMP,MPI and CUDA. I was wondering whether i can pursue my doctoral thesis by employing one or more of these parallel computing models for high utility itemset mining? My concern is regarding the worth of this idea. Seek your frank comments Sir .

  34. Mahmoud says:

    This is one of the best explanations I’ve ever read. I congratulate you for the great stuff …
    Thank you
    Mahmoud Darvish
    PhD student
    Free University of Iran-Semnan

  35. Lalitha says:

    Hello Sir,
    Can you explain me FCBF feature selection algorithm?

  36. Kuldeep Singh says:

    Dear Sir,
    could you please send me the synthetic dataset for High-Utility Pattern Mining. I am unble to generate these type of datasets as suggested in many research papers.

    • You can generate them using the SPMF software. It offers the synthetic dataset generator.

      http://www.philippe-fournier-viger.com/spmf/

      After installing the software, in the documentation, you should check examples:

      Example 141 : Generating a synthetic sequence database
      Example 144 : Generating synthetic utility values for a transaction database without utility values

      To generate a new dataset, you should first do Example 141. It will give a transaction database. Then you should do Example 144 to add the utility values to the dataset.

  37. Snehal Chaudhari says:

    Sir …please give me real life application where tku and tko algorithm are used?

    • Hello, the main application is to analyze customer transactions. With TKU/TKO, you need to set a parameter k. For example, let say that we set k = 100. These algorithms will find the k itemsets (sets of items) that generate the highest profit. For example, it could find that {bread, milk} is one of the most profitable set of items purchased by customers. Having this information, a business manager can take business decisions such as promoting these items to generate more money.

      This is the main application. However, the problem of high utility itemset mining is quite general. A transaction is basically a set of symbols, and utility values are some numbers that are asociated to these symbols. Thus it is easy to apply these algorithms in many other domains. For example, if you have e-learning data, each transaction could be the course taken by a student, and the utility values could be the score that the student received for the course. Then,
      you could analyze this data to find the groups of courses that the student have the highest score (if it would make sense… I don’t know.) But this is just an example. You could do the same with hospital data, etc. So there are many applications.

      • Snehal Chaudhari says:

        Thanks sir…for the reply…
        Can u explain me example of hospital

        • For hospital data, a transaction can be the medecine taken by a patient, and the utility values, the cost of the medicine. I’m not sure if it make sense. But it is just an example to show that it is easy to change the problem to another domain.

          Another example, it could be web log data. In that case a transaction is the webpages viewed by some user, and the utility values are the time spend on each page

    • Find the set of products that yield a high profit in a retail store. This is an application where TKO and TKU can be used.

  38. Snehal Chaudhari says:

    Thanks sir

  39. krishna says:

    Dear Sir,
    could you please describe what will be the profit value of mobile services if we have mobile service accessing sequences instead of transaction.

    Thanks in advance.

  40. Diem says:

    Thanks you for your post.

  41. Vandna dahiya says:

    Sir..
    how could we apply HUIM on big data?
    From where we can have large datasets?

    • Hello,

      I do not have datasets that are large enough to be called “big data” for HUIM. Thus, you could always use synthetic datasets by generating them with a tool such as SPMF. Or another solution would be to discuss with real companies and ask them for their data, or convert some other big datasets available on the internet to the format required for HUIM. However, it is not easy to find such datasets. The most simple solution is to use synthetic datasets.

      By the way, when we talk about big data, we should not only think about the amount of data. Actually, in HUIM, the amount of data is not the most difficult part of the problem. The main problem in HUIM is the size of the search space. For example, if you have a database that contains only 1 transactions with 500 items and set minutil = 1, the search space will be 2^500, and the algorithm will never terminate. Thus, even with small data, HUIM is very challenging. And you could have a very large database in HUIM that is not challenging because the search space is small. So what I mean is that if you develop a cloud algorithm for HUIM, it could still be useful even on small data if you can speed up the discovery of patterns. You do not need to have a huge dataset to benefit from a cloud implementation of HUIM algorithms.

      Regards

Leave a Reply

Your email address will not be published. Required fields are marked *