How to find cost-effective patterns in data?

Have you ever wondered how to find patterns in data that are not only frequent but also profitable and cost-effective? For example, if you are an online retailer, you may want to know what products are often bought together by customers and generate high profits but require low costs (such as shipping fees or discounts). Or if you are an educator, you may want to know what learning activities are frequently performed by students and result in high grades but require low efforts (such as time or resources).

people shopping in a supermarket

In this blog post, I will briefly introduce a new problem called low-cost high utility itemset mining (LCHUIM), which aims to find such patterns. LCHUIM is a generalization of the well-known problem of high utility itemset mining (HUIM), which focuses on finding patterns that have high utilities (benefits) according to a user-defined utility function. However, HUIM ignores the cost associated with these patterns, such as time, money or other resources that are consumed. The novel problem of LCHUIM addresses this limitation by considering both the utility and the cost of patterns.

To be more precise, a low-cost high utility itemset is an itemset (a set of values) that must satisfy three criteria:

  • Its average utility must be no less than a min_utility threshold set by the user.
  • Its average cost must be no greater than a max_cost threshold set by the user.
  • Its support (occurrence frequency) must be no less than a minsup threshold set by the user.

For example, suppose we have a transaction database that contains information about customers’ purchases and their profits and costs. A possible low-cost high utility itemset could be {bread, cheese}, which means that customers often buy bread and cheese together, and this combination generates high profits but requires low costs.

Finding low-cost high utility itemsets can reveal interesting insights for various domains and applications. For instance, online retailers could use LCHUIM to design effective marketing strategies or recommend products to customers. Educators can use LCHUIM to analyze students’ learning behaviors and provide personalized feedback or guidance.

To solve the problem of LCHUIM, an algorithm named LCIM (Low Cost Itemset Miner) was proposed. The algorithm uses a novel lower bound on the average cost called Average Cost Bound (ACB) to reduce the search space of possible patterns. The algorithm also employs several techniques such as prefix-based partitioning, depth-first search and subtree pruning to speed up the mining process.

The LCIM algorithm was published in this paper:

Nawaz, M. S., Fournier-Viger, P., Alhusaini, N., He, Y., Wu, Y. and Bhattacharya, D. (2022)LCIM: Mining Low Cost High Utility Itemsets . Proc. of the 15th Multi-disciplinary International Conference on Artificial Intelligence (MIWAI 2022), pp. 73-85, Springer LNAI  [ppt][source code and data]

The code and dataset of LCIM can be found in the SPMF open-source data mining library at: SPMF is an efficient open-source data mining library that contains more than 250 algorithms for various pattern mining tasks such as frequent itemset mining, sequential pattern mining, association rule mining and many more.

By the way, the concept of finding cost-effective patterns was also studied for the case of discovering sequential patterns with some algorithms called CorCEPB, CEPN, and CEPB (see this other blog post for more details).

I hope you enjoyed this short blog post and learned something new about low-cost high utility itemset mining. If you have any questions or comments, feel free to leave them below. Thank you for reading!

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250algorithms for pattern mining.

This entry was posted in Data Mining, Data science, Pattern Mining, spmf, Utility Mining and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *