A Glossary of High Utility Pattern Mining

Today, I present a glossary of key terms usedin high utility itemset mining. This glossary will be useful to researchers and practitioners working on this important subtopic of pattern mining. By the way, if you are new to this research area, you may find it useful to first read the survey paper that I wrote on this topic, or to watch my video lectures on this topic on my Youtube Channel. Also, you may want to check out the list of key research papers on high utility itemset mining that I made, which can also be useful to get started on this topic or to discover some topics.

  • Closed high utility itemset: A high utility itemset that has no superset with the same utility. For example, if {bread, milk} and {bread, milk, cheese} have the same utility, then {bread, milk} is not closed but {bread, milk, cheese} is closed. The first algorithm for this problem is CHUD. Then, many more algorithms were developed to improve the performance.
  • Correlated high utility itemset: A correlated high utility itemset is a set of items that not only yield a high utility but is also highly correlated. Various measures of correlations can be used such as the bond, and all-confidence.
  • Cost-effective pattern mining: A variation of the high utility itemset mining task where the goal is to find itemsets that have a high utility but a low cost.
  • Cross-level high utility itemset: This is variant of the high utility itemset mining problem where items have categories such as “electronic products” or “food”. Then, the goal is to find itemsets that have a high utility but which may contain categories. The first algorithm for the general problem of cross-level high utility itemset is CLH-Miner. Another is TKC.
  • External utility: A measure of the importance of each item. It is usually given by the user or domain expert. For example, in the context of a shopping database, the external utility may represent the profit that is generated by each item (e.g. each apple yield a 1$ profit, while each orange yield a 2$ profit)
  • Frequent itemset mining (FIM): A data mining task that finds all sets of items that appear in at least a minimum number of transactions in a transaction database. It is a special case of high utility itemset mining.
  • High-utility association rule mining: A data mining task that extends HUIM by finding association rules that have a high confidence and a high utility value. An association rule is an implication of the form X -> Y, where X and Y are itemsets.
  • High average-utility itemset: An itemset whose average-utility is no less than a minimum average-utility threshold set by the user. The average-utility of an itemset is its utility divided by its size (number of items).
  • High-utility and diverse itemset mining: A data mining task that extends HUIM by incorporating a notion of diversity to find patterns that cover different categories of items. The diversity of an itemset is defined as the number of distinct categories covered by its items divided by the total number of categories.
  • High-utility episode mining: A data mining task that extends HUIM by finding episodes that have a high utility value. An episode is a partially ordered set of events that frequently occur together in an event sequence. An event sequence is a sequence of events, where each event has a type and a timestamp. Several algorithms have been proposed for this task such as UP-SPAN and HUE-SPAN.
  • High-utility graph pattern mining: A data mining task that extends HUIM by finding graph patterns that have a high utility value. A graph pattern is a subgraph that frequently appears in a graph database. A graph database is a database where each transaction is a graph, which is a set of nodes and edges. A representative algorithm for this problem is UGMINE.
  • High-utility itemset embedding: Some algorithms like HUI2Vec have been designed to use high utility itemsets to create embeddings of transaction databases. This is a recent topic.
  • High utility quantitative itemset mining: A data mining task that extends HUIM to find high utility itemsets with additional information about the quantities that are usually purchased for each itemset. This is a harder problem. There exist various algorithms for this problem such as FHUQI-Miner, HUQI-Miner and VHUQI.
  • High utility itemset: An itemset whose utility is no less than a minimum utility threshold set by the user. For example, if the minimum utility threshold is 10, then {bread, milk} is a high utility itemset if its utility is 10 or more
  • High utility itemset mining (HUIM): A data mining task where the goal is to finds sets of values (itemsets) in a transaction database that have a high importance, and where the importance is measured by a utility function. The typical application of HUIM is to find the sets of products that are purchased together and yield a high profit in transactions made by customers. But also many other applications exist. There exists many algorithms such as HAMM, REX, UP-Growth, EFIM, HUI-Miner, FHM, ULB-Miner, Two-Phase and others.
  • High-utility rare itemset mining: A data mining task that combines HUIM and rare itemset mining. It aims to find itemsets that have a high utility value but a low support value in a transaction database. It can be useful for finding profitable but infrequent patterns.
  • High-utility sequential pattern mining: A data mining task that extends HUIM by finding sequential patterns that have a high utility value. A sequential pattern is an ordered list of itemsets that frequently appear in a sequence database. A sequence database is a database where each transaction is associated with a timestamp or an order. A survey of high utility sequential pattern mining is here.
  • High utility sequential rule mining: A problem where the goal is to find temporal rules that have a high utility and indicate that if someone buy something, they will buy something else after. The first algorithm for mining high utility sequential rules is HUSRM.
  • Internal utility: A measure of the quantity of an item in a transaction (record). For instance, in the case of analyzing shopping data, the internal utility in a customer transaction is the quantities of items. Thus, a transaction may indicate, for example, that a customer has bought 4 apples, and 3 breads, and the internal utility of apple and bread in that transaction is 4 and 3, respectively. 
  • Itemset: A set of items. For example, {bread, milk} is an itemset containing two items.
  • Local high utility itemset: A high utility itemset that is found in a database where transactions have timestamps. More precisely a local high utility itemset is an itemset that has a high utility in a time interval but not necessarily in the whole database. For example, some products may only yield a high profit during the summer in a store. LHUI-Miner is the first algorithm for this problem. A variation is also peak high utility itemset mining, described in the same paper.
  • Maximal high utility itemset: A high utility itemset that has no superset that is also a high utility itemset. For example, if {bread, milk} is a high utility itemset but {bread, milk, cheese} is not, then {bread, milk} is maximal. The first algorithm for this problem is CHUI-Miner(max).
  • Minimal High Utility Itemsets. A minimal high utility itemset is an itemset that is not a superset of another high utility itemsets. The first algorithm for this problem is MinFHM.
  • One-phase HUIM algorithm: A HUIM algorithm that can mine high utility itemsets without generating candidates. Several algorithms of this type such as HUI-Miner and FHM rely on the utility-list structure to efficiently calculate the utility of itemsets and prune the search space.
  • Pattern-growth algorithm: An algorithm that uses a divide-and-conquer strategy to find high utility itemsets. It recursively constructs conditional databases to explore the search space by performing a database projection operation. Some algorithms of this type are IHUP, UP-Growth, EFIM, HAMM and REX.
  • Periodic high utility itemset: An itemset that periodically appears over time and has a high utility. For example, a customer that buy cheese and bread every week. The first algorithm for this problem is PHM.
  • Pruning strategy: A technique that reduces the search space by removing unpromising items or itemsets from consideration. The goal of using pruning strategy in an algorithm is to improve performance (solve the problem more quickly or using less memory).
  • Quantitative database: A transaction database where each item has a quantity and an external utility value. The quantity of an item represents how many units of that item are present in a transaction. The external utility of an item represents the amount of profit yield by the sale of one unit of that item.
  • Stable high utility itemset: An itemset that has a utility that remains more or less stable over time. See the paper by Acquah Hackman for more details.
  • Support: The number of transactions that contain an itemset. For example, the support of {bread, milk} is 3 if it appears in three transactions1.
  • Top-K high utility itemset mining: A variant of HUIM that finds the K most profitable itemsets in a database without requiring the user to set a minimum utility threshold. There are many algorithms for this problem. The first one is TKU. Then, more efficient algorithms were proposed such as TKO.
  • Transaction. A transaction is a set of items bought by a customer. More generally, it can also be viewed as a database record containing some values.
  • Transaction database: A database containing a set of transactions made by customers. 
  • Transaction utility (TU): The sum of the utilities of all items in a transaction.  For example, the TU of a transaction where bread and milk are bought with quantities 5 and 2 respectively and profit units 0.5 and 1 respectively is (5 x 0.5) + (2 x 1) = 4.5.
  • Transaction-weighted utilization (TWU): An upper bound on the utility that can be used to reduce the search space for high utility itemset mining. There exists several upper bounds. The TWU is not the tightest upper bound but it is the most popular and has been used by many algorithms.
  • Two-phase HUIM algorithm: A HUIM algorithm that consists of two phases. In the first phase, it generates candidates by applying some pruning strategies based on some upper bounds on the utility. In the second phase, it scans the database again to calculate the exact utility of each candidate and filter out those that are not high utility. Some popular two-phase algorithms are Two-Phase, IHUP and UP-Growth.
  • Utility: A measure of importance or profit of an item or an itemset. It is typically calculated by multiplying the quantity and the external utility of an item in a transaction. But other utility functions could also be used.
  • Utility-list: A data structure that stores the utility information of an itemset in a compact way. It consists of a list of tuples, each containing the transaction identifier, the utility of the itemset in that transaction, and the remaining utility of the items after that itemset in that transaction. It is used by many algorithms such as HUI-Miner, FHM, ULB-Miner and many others.
  • Utility threshold: A user-defined parameter that specifies the minimum utility value for an itemset to be considered as high utility. It can be either absolute or relative. An absolute utility threshold is a fixed value, while a relative utility threshold is a percentage of the total utility of the database.

Hope that this is useful. If you think that I should add more definitions, you may suggest some in the comment section below!

If you want to try high utility itemset mining, you can also check the SPMF open-source software, which offers the most complete collection of such algorithms, and also provides datasets and other resources for carrying research on this topic.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Posted in Data Mining, Data science, Pattern Mining, Utility Mining | Tagged , , , , , , , , | 1 Comment

My paper is rejected. Can I appeal the editor’s decision?

Today, I will talk about what to do when a paper is rejected from a journal, and more specifically if it is possible to appeal the editor’s decision, and then what are the chance of success?

Generally, if a researcher sends papers to journals, it will sooner or later happen that some papers will be rejected. And sometimes it seems that the decision was unfair for various reasons. For instance, it may seem that reviewers have not understood the content of the paper or have been overly critical. Thus, several researchers will feel angry, disappointed, frustrated and think that the process was unfair. Sometimes it may be justified, but sometimes not.

Thus, what should a researcher do? A possibility is to try to argue with the editor or appeal the decision. The other possibility is to move on, fix the problems with the paper and submit it to another journal.

In my experience, trying to appeal the editor decision has a very low chance of success unless there was clearly some serious wrongdoing such as a reviewer submitting the review for the wrong paper. But in general, still different journals have different policies and procedures for handling appeals. I will now talk about this in more details.

When to appeal a journal rejection?

First of all, it is important to carefully read the rejection letter and the reviewers’ comments. If the rejection is based on factual errors, misunderstandings, or misinterpretations of your work, you may have grounds for an appeal. However, if the rejection is based on subjective opinions, preferences, or criticisms that are not constructive or specific, you may have little chance of success.

You should also consider the type and severity of the rejection. Some journals may reject your manuscript outright, without offering any possibility of resubmission or revision. But others may reject your manuscript but allow you to make substantial changes and improvements and then to resubmit. In these cases, an appeal may not be appropriate or necessary.

How to appeal a journal rejection?

As I said previously, I personally think that appeals have a low chance of success unless there is some serious issue with the reviewing process. But if you decide to appeal a journal rejection, you should follow these steps:

  • Determine the appeal protocol (if there is one). Some journals provide an appeal process on their website. If no process is given (more likely), find the email of the editor-in-chief on the journal website and explain your situation. Generally, the editor is very busy so try to explain everything in a single e-mail.
  • Always be respectful towards the editor, reviewers and other persons involved. Courtesy, humility and calm go a long way in winning a re-review. Do not be rude, aggressive, defensive, or emotional in your appeal letter. Acknowledge the editor’s and reviewers’ efforts and expertise, and thank them for their feedback.
  • Construct your argument clearly around why you believe the appeal is justified. Provide evidence and references to support your claims. Address each point of criticism or concern raised by the reviewers, and explain why you disagree or how you have resolved it. Be concise and specific, and avoid repeating yourself or making irrelevant comments.

What to expect from an appeal?

Keep in mind that appeals are rarely successful. Your appeal must be a simple and concise rational argument based on facts over emotions.

If your appeal is accepted, you may receive a positive response from the editor-in-chief or a senior editor who will agree to reconsider your manuscript or send it to new reviewers. However, this does not guarantee acceptance; your manuscript will still undergo another round of peer review and editorial evaluation.

If your appeal is rejected, you may receive a negative response from the editor-in-chief or a senior editor who will uphold the original decision and explain why your appeal was not convincing or valid. In this case, you should respect their final verdict and move on.

When to submit your manuscript to another journal

If your appeal is unsuccessful or not possible at all, you should look for another journal that is suitable for your manuscript. Before submitting, check the scope, aims, and requirements of the new journal. Make sure that your manuscript matches the journal’s focus and expectations, and that you follow the journal’s guidelines for authors. Also try to fix the problems in your papers that reviewers have mentioned in their reviews.

Conclusion

Receiving a rejection from a journal can be discouraging and seem unfair. Generally, you can either appeal the decision or submit your manuscript to another journal. In either case, you should be respectful in your communication with the journal. Remember that rejection is part of the process, and that it can help you improve your manuscript and increase your chances of success in the future.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Academia | Tagged , , , , , , , | Leave a comment

What are Generator Itemsets?

Today, I will briefly explain what is a generator itemset. I will give some example and explain why generator itemsets are interesting and useful for some applications. I will also mention that efficient implementations can be found in the SPMF data mining library (which I am the founder).

What is a generator itemset?

In data mining, a popular task is frequent itemset mining. It consists of finding frequent itemsets in a database. A frequent itemset is a set of values that frequently appear together in records of a database. The classical application of itemset mining is to analyze transaction databases to study the behavior of customers in a store. In that context, each record is a customer transaction, and the goal is to find sets of products bought together by at least minsup customers where minsup is a threshold set by the user.

For example, consider the following transaction database containing five transactions:

Transaction 1 | A, B, C, D
Transaction 2 | A, B, C, D
Transaction 3 | A, C
Transaction 4 | B, C
Transaction 5 | A

where A, B, C, D represent the purchase of Apple, Bread, Cake, and Dates respectively. The support of an itemset is the number of transactions that contain it. For instance, the support of {A} is 4, the support of {B} is 3, and the support of {A, B} is 2. A frequent itemset is an itemset whose support is greater than or equal to a minimum support threshold. Let’s assume that the minimum support is 2. Then, the frequent itemsets are:

{A} support : 4
{B} support: 3
{C} support : 4
{D} suport : 2
{A, B} support : 2
{A, C} support : 3
{A, D} support: 2
{B, C} support : 3
{B, D} support: 2
{C, D} support: 2
{A, B, C} support: 2
{A, B, D} support: 2
{A, C, D} support: 2
{B, C, D} support: 2
{A, B, C, D} support: 2

Among these frequent itemsets, some are generators and some are not. A (frequent) generator itemset is a frequent itemset that does not have any subset with the same support. For example, {A,B,C,D} is not a generator because it has a subset {A} with the same support (2). However, {A, B} is a generator because none of its subsets ({A}, {B}) has the same support (2). The generator itemsets in this example are:

{A} support : 4
{B} support: 3
{C} support : 4
{D} suport : 2
{A, B} support : 2
{A, C} support : 3

As it can be observed, frequent generator itemsets are a subset of all frequent itemsets.

Why are generator itemsets useful?

Generator itemsets (also called key itemsets) are useful for several reasons. First, they can be used to recover all the frequent itemsets by applying a closure operator. The closure operator takes a generator itemset and adds all the items that appear in every transaction containing that generator. For example, the closure of {A, B} is {A, B, C,D} because {A, B} appears in transactions 1 and 2, and both transactions also contain C and D. By applying the closure operator to all the generators, we can recover all the frequent itemsets.

Second, generator itemsets can help us understand the structure of the data. Generator itemsets represents the smallest sets of values that are common to some sets of transactions. In the context of shopping, it means that smallest sets of products that are common to some groups of customers. Generators are also often used for classification because they allow to extract the smallest sets of features that describe groups of instances (transactions).

Third, discovering generator itemsets can reduce the number of itemsets that are discovered. By mining generator itemsets instead of all frequent itemsets, we can find only the smallest itemsets that have a given support, which is more concise and can be more meaningful。

How to mine generator itemsets?

There are several algorithms that can be used to mine generator itemsets from a transactional database. One of them is Pascal, which is based on the Apriori algorithm. Pascal works by generating candidate itemsets of increasing size and checking their support in the database. However, unlike Apriori, Pascal only generates candidates that are generators or potential generators. The key property used by Pascal to reduce the search space is that if an itemset is not a generator than all its supersets are also not generators.

Efficient implementations of algorithms

If you want to try discovering generator itemsets and other kinds of patterns, the SPMF open-source library more than a hundred algorithms for mining itemsets and other patterns, including generator itemsets, frequent itemsets, and closed itemsets. SPMF is easy to use and provides a graphical user interface and a command line interface.

To use SPMF to mine generator itemsets from a transactional database, we need to follow these steps:

  • Download and install SPMF from https://www.philippe-fournier-viger.com/spmf/
  • Prepare an input file in the SPMF format (a text file where each line represents a transaction and items are separated by spaces)
  • Choose an algorithm for mining generator itemsets (such as Pascal and DefMe) and set the parameters (the minimum support threshold).
  • Run the algorithm and open the output text file.

For more details and examples on how to use SPMF to mine generator itemsets or other patterns from data, please refer to the documentation and tutorials on https://www.philippe-fournier-viger.com/spmf/

A video

Lastly, if you are interested by this topic, I have also uploaded a video where I explain in more details what are generator itemsets, closed itemsets and maximal itemsets. This video is part of my free pattern mining course, and can be watched here: Maximal, Closed and Generator itemsets or from my youtube channel here.

Conclusion

In this blog post, I have explained what is a generator itemset and why it is useful for data analysis. Hope that it was interesting. 😉

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Data Mining, Pattern Mining | Tagged , , , , , , , , | 1 Comment

How to find cost-effective patterns in data?

Have you ever wondered how to find patterns in data that are not only frequent but also profitable and cost-effective? For example, if you are an online retailer, you may want to know what products are often bought together by customers and generate high profits but require low costs (such as shipping fees or discounts). Or if you are an educator, you may want to know what learning activities are frequently performed by students and result in high grades but require low efforts (such as time or resources).

people shopping in a supermarket

In this blog post, I will briefly introduce a new problem called low-cost high utility itemset mining (LCHUIM), which aims to find such patterns. LCHUIM is a generalization of the well-known problem of high utility itemset mining (HUIM), which focuses on finding patterns that have high utilities (benefits) according to a user-defined utility function. However, HUIM ignores the cost associated with these patterns, such as time, money or other resources that are consumed. The novel problem of LCHUIM addresses this limitation by considering both the utility and the cost of patterns.

To be more precise, a low-cost high utility itemset is an itemset (a set of values) that must satisfy three criteria:

  • Its average utility must be no less than a min_utility threshold set by the user.
  • Its average cost must be no greater than a max_cost threshold set by the user.
  • Its support (occurrence frequency) must be no less than a minsup threshold set by the user.

For example, suppose we have a transaction database that contains information about customers’ purchases and their profits and costs. A possible low-cost high utility itemset could be {bread, cheese}, which means that customers often buy bread and cheese together, and this combination generates high profits but requires low costs.

Finding low-cost high utility itemsets can reveal interesting insights for various domains and applications. For instance, online retailers could use LCHUIM to design effective marketing strategies or recommend products to customers. Educators can use LCHUIM to analyze students’ learning behaviors and provide personalized feedback or guidance.

To solve the problem of LCHUIM, an algorithm named LCIM (Low Cost Itemset Miner) was proposed. The algorithm uses a novel lower bound on the average cost called Average Cost Bound (ACB) to reduce the search space of possible patterns. The algorithm also employs several techniques such as prefix-based partitioning, depth-first search and subtree pruning to speed up the mining process.

The LCIM algorithm was published in this paper:

Nawaz, M. S., Fournier-Viger, P., Alhusaini, N., He, Y., Wu, Y. and Bhattacharya, D. (2022)LCIM: Mining Low Cost High Utility Itemsets . Proc. of the 15th Multi-disciplinary International Conference on Artificial Intelligence (MIWAI 2022), pp. 73-85, Springer LNAI  [ppt][source code and data]

The code and dataset of LCIM can be found in the SPMF open-source data mining library at: http://philippe-fournier-viger.com/spmf/ SPMF is an efficient open-source data mining library that contains more than 250 algorithms for various pattern mining tasks such as frequent itemset mining, sequential pattern mining, association rule mining and many more.

By the way, the concept of finding cost-effective patterns was also studied for the case of discovering sequential patterns with some algorithms called CorCEPB, CEPN, and CEPB (see this other blog post for more details).

I hope you enjoyed this short blog post and learned something new about low-cost high utility itemset mining. If you have any questions or comments, feel free to leave them below. Thank you for reading!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250algorithms for pattern mining.

Posted in Data Mining, Data science, Pattern Mining, spmf, Utility Mining | Tagged , , , , , , , , , , , | Leave a comment

What is a Closed Itemset and Why is it Useful?

In this blog post, I will explain in simple terms what is a closed itemset and give some examples. I will also mention a few algorithms that can be used to find closed itemsets and that they can be found in the SPMF data mining library in Java.

Frequent itemset mining is a popular technique for discovering interesting relationships between items in data represented as a table. The classical application of frequent itemset mining is to analyze data from a transactional database to find correlations between the items that customers purchase.

An itemset is a set of items that appear together in a transaction or a record. For example, if we have a database of customer transactions, an itemset could be {bread, butter,jam}, which means that some customers have bought bread, butter and jam together.

A frequent itemset is an itemset that appears frequently in a database, that is, its occurrence is above a minimum threshold (minsup) set by the user. For example, if minsup is set to 10%, then a frequent itemset is an itemset that appears in at least 10% of the transactions made by customers. Frequent itemsets have many applications. In the context of analyzing customer transactions, they can be used for marketing, recommendation systems, or cross-selling.

Though frequent itemset mining has many applications, a problem is that it can generate a very large number of frequent itemsets, especially when the database contains many items and transactions. This can make it difficult for users to analyze and interpret the results, and also consume a lot of memory and computational resources.

To address this problem, one can use the concept of frequent closed itemset. A frequent closed itemset is a frequent itemset that has no frequent superset with the same support (appear in the same number of transactions). For example, if {bread, milk} has a support of 15% and {bread, milk, cheese} has a support of 15% as well, then {bread, milk} is not a frequent closed itemset because it has a frequent superset with the same support.

Let’s take an example. Suppose we have a database with four customer transactions, denoted as T1, T2, T3 and T4:

T1: {a,b,c,d}
T2: {a,b,c}
T3: {a,b,d}
T4: {a,b}

where the letters a, b, c, d indicate the purchase of items apple, bread, cake and dattes.

If we set the minimum support threshold to 50% (which means that we want to find itemsets appearing in at least two transactions), a frequent itemset mining algorithm will output the following frequent itemsets:

{a}, {b}, {c}, {d}, {a,b}, {a,c}, {a,d}, {b,c}, {b,d}, {a,b,c}, {a,b,d}

But among these frequent itemsets, only four are frequent closed itemsets:

{a,b}, {a,b,c}, {a,b,d}, {a,b,c,d}

The reason is that these itemsets have no frequent supersets with the same support. For example, {a,b} has a support of 100%, and none of its supersets such as ({a,b,c}, {a,b,d}, {a,b,c,d}) have the same support. On the other hand, {a,c} is not closed because it has a superset ({a,b,c}) with the same support (50%).

Why are closed itemsets useful?

Closed itemsets are important because they can reduce the number of frequent itemsets presented to the user, without losing any information. Frequent itemsets can be very large and redundant, especially when the minsup is low or the database is dense. By mining closed itemsets, generally only a very small set of itemsets is obtained, and still all the other frequent itemsets can be directly derived from the closed itemsets. In other words, any frequent itemset can be derived from a closed itemset by removing some items. For example, we can obtain {a} from {a,b} by removing b, or we can obtain {b,d} from {a,b,c,d} by removing a and c. Therefore, by using closed itemsets, we can reduce the number of itemsets presented to the user without missing any important information.

How to find closed itemsets?

There are several algorithms that have been proposed for this task, such as Apriori-Close, CHARM, FP-Close and LCM. These algorithms are based on different strategies, such as pruning, merging, prefix trees or bitmap representations, to efficiently find all closed itemsets without missing any or generating any false positives. Some of these algorithms have also been adapted to mine maximal itemsets, which are closed itemsets that have no frequent supersets at all.

If you want to try these algorithms on your own data, you can use the SPMF data mining library in Java. SPMF is an open-source library that provides very efficient implementations of various data mining algorithms, including those for closed itemset mining. You can download SPMF from its website (http://www.philippe-fournier-viger.com/spmf/) and follow the documentation and examples to run the algorithms on your data. You can also use SPMF as a command line tool or integrate it with other Java applications, or program written in other programming languages such as C#, Python and R (see the website).

A video

If you are interested by this topic, you can also watch a 50 minute video where I explain in more details what are closed itemsets and their properties, but also maximal itemsets and generator itemsets. This video is part of my free pattern mining course, and you can watch it here: Maximal, Closed and Generator itemsets or from my youtube channel here.

Hope that this blog post has been interesting 🙂

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Data Mining, Data science, Pattern Mining | Tagged , , , , , , , , , , | 5 Comments

Introducing PSAC-PDB: A Novel Approach to Protein Structure Analysis and Classification

In this blog post, I will give a short overview of our recent research paper about protein structure analysis and classification by my team (M. S. Nawaz, P. Fournier-Viger, Y. He and Q. Zhang).

Proteins are essential molecules for life, as they perform a variety of functions in living organisms, such as catalysis, transport, signaling, defense, and regulation. The structure of proteins determines their function and interactions with other molecules. Therefore, understanding the structure of proteins is crucial for advancing biological and medical research. In biophysics and computational biology, predicting 3D structure of protein from its amino acid sequence is an important research problem.

However, protein structure analysis and classification is a challenging task, as proteins have complex and diverse shapes that can be affected by various factors, such as environment, mutations, and interactions. Moreover, the number of protein structures available in databases such as PDB and EMDB is increasing rapidly, making it difficult to manually annotate and compare them.

Thus, we recently published a research paper, which proposes a novel method for protein structure analysis and classification. The paper is titled “PSAC-PDB: Analysis and Classification of Protein Structures” and it will appear in the journal Computers in Biology and Medicine, published by Elsevier. The reference of the paper is:

Nawaz, S. M., Fournier-Viger, P., He, Y., Zhang, Q. (2023). PSAC-PDB: Analysis and Classification of Protein Structures. Computers in Biology and Medicine, Elsevier, Volume 158, May 2023, 106814

The main contribution of the paper is to propose a new framework called PSAC-PDB (Protein Structure Analysis and Classification using Protein Data Bank).

The main contribution of the paper is to propose a new framework called PSAC-PDB. A sechma of the overall method can be found below:

Briefly, PSAC-PDB is a computational method that uses a protein structure comparison tool (DALI) to find similar protein structures to a query structure in PDB, and then uses amino acid sequences, aligned amino acids, aligned secondary structure elements, and frequent amino acid patterns to perform classification. PSAC-PDB applies eleven classifiers and compares their performance using six evaluation metrics.

PSAC-PDB also uses sequential pattern mining (SPM) to discover frequent amino acid patterns that can improve the classification accuracy. SPM is a data mining technique that finds subsequences that appear frequently in a set of sequences. SPM can capture the sequential ordering and co-occurrence of amino acids in protein sequences, which can reflect their structural and functional properties. Some examples of frequent sequential patterns of Amino Acids (AA) extracted by the TKS and CM-SPAM algorithms are shown as example in the table below for three families: the S protein structure of SARS-CoV-2 (SSC2), the S protein structures of other viruses and organisms (SO), and Protein (enzyme) structures for others (O).

PSAC-PDB was tested on the case study of SARS-CoV-2 spike protein structures, which are responsible for the entry of the virus into host cells. PSAC-PDB finds 388 similar protein structures to the query structure in PDB, and divides them into three families: S protein structures of SARS-CoV-2, S protein structures of other viruses and organisms, and other protein structures. PSAC-PDB uses four datasets based on amino acid sequences, aligned amino acids, aligned secondary structure elements, and frequent amino acid patterns for classification.

Results have shown that PSAC-PDB achieves high accuracy, precision, recall, F1-score, MCC and AUC values for all three families of protein structures, than state-of-the-art approaches for genome sequence classification. Here are some of the results showing this:

But what is also interesting is that PSAC-PDB shows that using frequent amino acid patterns or aligned amino acids can improve the classification performance compared to using only amino acid sequences or aligned secondary structure elements. Thus, PSAC-PDB can benefit the research community in structural biology and bioinformatics.

For more information, please see the paper. Besides, the datasets from this paper can also be found on Github: https://github.com/saqibdola/PSAC-PDB and implementations of sequential pattern mining algorithms used in the paper can be found in the SPMF data mining software.

Posted in Bioinformatics | Leave a comment

DSSBA 2023, 2nd Special Session on Data Science for Social and Behavioral Analytics @DSAA

I am excited to announce that the 2nd Special Session on Data Science for Social and Behavioral Analytics (DSSBA 2023) will be held at the 10th IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA 2023) from October 9-13, 2023.

DSSBA 2023 is a forum for researchers and practitioners to present and discuss the latest advances on the design of efficient, scalable and effective solutions for analyzing social and behavioral data. Social and behavioral data are collected from various sources such as social networks, e-learning systems, e-commerce platforms, health care systems, etc. Analyzing these data can help us gain a better understanding of human behavior and social interactions, which can support decision making, personalized recommendation, behavior prediction, behavior change, etc.

The topics of interest for DSSBA 2023 include, but are not limited to:

  • Efficient and scalable algorithms for behavioral and social analytics
  • Evaluation of behavioral analytic models
  • Privacy-preserving techniques for behavioral and social analytics
  • Cognitive and social aspects of behavior analysis
  • Intelligent systems and services powered by behavioral and social data models
  • Applications of behavioral and social analytics in various domains such as education, health care, business, etc.
  • Pattern mining and machine learning models, etc.

We invite you to submit your original research papers to DSSBA 2023 by June 1, 2023. The submission guidelines and more information can be found on the DSSBA 2023 website. The accepted papers will be published in the DSAA 2023 proceedings.

We look forward to receiving your submissions and seeing you at DSSBA 2023!

Posted in cfp, Conference, Data Mining, Data science | Tagged , , , , , , , , , | Leave a comment

Discovering the Top-K Stable Periodic Patterns in a Sequence of Events

In this blog post, I will give a brief introduction to the TSPIN paper about how to find stable periodic patterns in a sequence of events. This algorithm was presented in this research paper:

Fournier-Viger, P., Wang Y., Yang, P., Lin, J. C.-W., Yun, U. (2021). TSPIN: Mining Top-k Stable Periodic Patterns. Applied Intelligence. [source code & data]

What is a periodic pattern? Periodic patterns are sets of values or symbols that appear repeatedly in a sequence of events at more or less regular intervals. For example, in a customer transaction database, a periodic pattern may indicate that some customers buy milk every week or bread every two days. Discovering periodic patterns can thus help to understand customer behavior, optimize inventory management, forecast sales, and so on. For example, here is an illustration of a periodic pattern {a,c} that appears every two hours in a sequence of events, where a is for apple and c is for cake:

The idea of finding periodic pattern is interesting as it can provide insights about the data. However, a major problem is that traditional algorithms for periodic pattern mining have two important limitations:

First, they use a very strict definition of periodicity that requires a pattern to always appear again within a fixed time limit. As a result, some meaningful patterns may be discarded if they are slightly irregular. For example, if a user wants to find periodic patterns that appear weekly, a traditional algorithm will discard a pattern if it does not appear for a week even though it may appear weekly for the rest of the year.

Second, several algorithms use a minimum support threshold to filter out infrequent patterns. This parameter is useful but generally users don’t know how to set it to find patterns and thus adjust it by trial and error to find a suitable value.

To overcome these limitations, the TSPIN (Top-k Stable Periodic pattern mINer) algorithm was designed. It combines two key ideas:

1) Stability: A measure of stability is used in TSPIN to evaluate how consistent a pattern is in terms of its periodicity. A stable pattern has relatively small variations in its occurrence intervals, while an unstable pattern has large variations. TSPIN finds stable periodic patterns but it is designed to be less strict than traditional algorithms so as to also find patterns that are slightly irregular.

2) Top-k: A way of ranking patterns based on their frequency and selecting only the k most frequent ones. This avoids the need to specify a minimum support threshold.

The TSPIN algorithm is an efficient algorithm that integrates several optimizations. Those are explained in the TSPIN research paper. Also that paper describes several experiments on synthetic and real datasets. The results show that TSPIN has good performance and can reveal interesting patterns in real-life data.

There are several potential applications of TSPIN. For example, example, applying TSPIN on a dataset of web clickstream data may reveal that some users visit certain websites periodically with high stability, such as news portals or social media platforms. This information could then be used for personalized recommendation or advertising.

To try TSPIN, the code and datasets are available in the SPMF pattern mining library, which is a popular Java software for pattern mining, which can also be used from other programming languages. SPMF also offers implementations of a dozen other algorithms for discovering periodic patterns and hundreds of algorithms to find other types of patterns such as sequential rules, sequential patterns, frequent itemsets and high utility itemsets.

Conclusion

In conclusion, TSPIN is an innovative algorithm for mining top-k stable periodic patterns in discrete sequences. It addresses some limitations of traditional algorithms by using a more flexible definition of periodicity and avoiding parameter tuning. It can discover meaningful patterns that reflect regular behaviors or habits in various domains.

Hope this blog post has been interesting. If you like this topic, you may also check my list of key papers on periodic pattern mining on this blog.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Big data, Data Mining, Pattern Mining, spmf | Tagged , , , , , , , , | Leave a comment

Data disasters and more…

Today, I will not talk about a serious topic. Recently, I have been experimenting a little bit with generative AI tools for pictures during my free time. Some services can generate some pretty nice pictures based on text prompts. For fun, I have combined the concept of “data” with different types of disasters such as “volcano”, “tsunami”, thunderstorm” and other concepts. Here are a few generated pictures.

Data tsunami

Data volcano

Data thunderstorm

Data tornado

Data mountain

Data vortex

Data waterfall

Interesting?

Actually, it did not take much time to generate these pictures and it is possible to generate many other variations by using the same prompts. But if you use the above pictures as they are, please give credit to this blog and link to this page.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250 algorithms for pattern mining.

Posted in Data Mining, Data science, General, Machine Learning | Leave a comment

UDML 2023 workshop!

I’m excited to announce that I am co-organizing a new edition of the Utility-Driven Mining and Learning (UDML) workshop, which will be colocated with the IEEE International Conference on Data Mining (ICDM) 2023 in Shanghai, China.

The UDML workshop aims to bring together researchers and practitioners who are interested in developing and applying utility-driven methods for data mining and machine learning. Utility-driven methods are those that consider not only the accuracy or interestingness of the results, but also their usefulness or value for specific applications or users. For example, utility-driven methods can take into account user preferences, constraints, costs, benefits, risks, or trade-offs when mining or learning from data.

The UDML workshop will feature invited talk by a leading expert in the field, as well as contributed papers that showcase novel research ideas, methods, systems, applications, and challenges related to utility-driven mining and learning. The workshop will also provide a platform for networking and discussion among researchers and practitioners who share a common interest in this topic.

If you are working on or interested in utility-driven mining and learning, I invite you to submit your paper or poster to the UDML workshop by September 1st 2023. The submission guidelines and topics of interest can be found on the workshop website: https://www.philippe-fournier-viger.com/utility_mining_workshop_2023/

I also encourage you to attend the UDML workshop on December 4th 2023 at ICDM 2023 in Shanghai. You will have the opportunity to learn from the presentation, interact with the authors of the accepted papers and posters, and network with other participants.

I hope to see you at UDML 2023!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250 algorithms for pattern mining.

Posted in Big data, Conference, Data Mining, Data science, Pattern Mining | Tagged , , , , , , , | Leave a comment