What are the applications of sequential pattern mining?

Sequential pattern mining (SPM) is a data mining technique that aims to discover patterns or subsequences that occur frequently in sequential data. Sequential data can be found in various domains such as customer transactions, web browsing, text analysis, and bioinformatics. SPM can help reveal interesting and useful insights from such data, such as customer behavior, website navigation, text structure, and gene expression.

Because I often receive the questions about what the applications of sequential pattern mining are, I will talk about this in this blog post. I will talk about several example applications of sequential pattern mining, including malware detection, COVID-19 analysis, e-learning and point-of-interest recommendation. I will briefly explain how SPM can be used for these tasks and cite some research papers that have applied SPM for these purposes.

Note: The list of applications in this page is far from exhaustive. There are hundreds of applications, and I also do not cite all the papers as there are too many. I give some of my papers as examples as I am familiar with the content.

Sequential Pattern Mining for Malware Detection

Malware is a malicious software that can harm computers and networks by stealing data, encrypting files, disrupting services, or spreading viruses. Malware detection is the task of identifying and classifying malware based on their behavior and characteristics. One of the challenges of malware detection is that malware can change their appearance or code to evade detection by using techniques such as obfuscation, packing, encryption, or metamorphism.

One way to overcome this challenge is to analyze the behavior of malware during execution, rather than their static code. A common way to represent the behavior of malware is by using the Application Programming Interface (API) calls that they make to interact with the operating system or other applications. API calls can reflect the actions and intentions of malware, such as creating files, modifying registry keys, sending network packets, or injecting code.

Sequential pattern mining can be used to discover frequent patterns of API calls made by different malware families or variants. These patterns can then be used as features for malware classification or clustering. For example, Nawaz et al. (2022) proposed MalSPM, an intelligent malware detection system based on sequential pattern mining for the analysis and classification of metamorphic malware behavior. They used a dataset that contains sequences of API calls made by different malware on Windows OS and applied SPM algorithms to find frequent API calls and their patterns. They also discovered sequential rules between API call patterns as well as maximal and closed frequent API call patterns. They then used these patterns as features for seven classifiers and compared their performance with existing state-of-the-art malware detection approaches. They showed that MalSPM outperformed these approaches in terms of accuracy, precision, recall, F1-score, and ROC-AUC.

Research papers:
Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever,

Blog post:
How to Detect and Classify Metamorphic Malware with Sequential Pattern Mining (MalSPM)

Research paper on related topic of botnet detection:
Daneshgar FF, Abbaspour M. A two-phase sequential pattern mining framework to detect stealthy P2P botnets. Journal of Information Security and Applications. 2020 Dec 1;55:102645.

Sequential Pattern Mining for Healthcare

COVID-19 is a novel coronavirus disease that emerged in late 2019 and has since spread to more than 200 countries, causing more than 5 million deaths worldwide. COVID-19 genome sequencing is critical to understand the virus behavior, its origin, how fast it mutates, and for the development of drugs/vaccines and effective preventive strategies.

SPM can be used to learn interesting information from COVID-19 genome sequences, such as frequent patterns of nucleotide bases and their relationships with each other. These patterns can help examine the evolution and variations in COVID-19 strains and identify potential mutations that may affect the virus infectivity or virulence. For example, Nawaz et al. (2021) applied SPM on a corpus of COVID-19 genome sequences to find interesting hidden patterns. They also applied sequence prediction models to evaluate if nucleotide bases can be predicted from previous ones. Moreover, they designed an algorithm to find the locations in the genome sequences where the nucleotide bases are changed and to calculate the mutation rate. They suggested that SPM and mutation analysis techniques can reveal interesting information and patterns in COVID-19 genome sequences.

Research papers:

Nawaz, S., Fournier-Viger, P., Aslam, M., Li, W., He, Y., Niu, X.(2023). Using Alignment-Free and Pattern Mining Methods for SARS-CoV-2 Genome Analysis. Applied Intelligence.

Nawaz, S. M., Fournier-Viger, P., He, Y., Zhang, Q. (2023). PSAC-PDB: Analysis and Classification of Protein Structures. Computers in Biology and Medicine, Elsevier,

Blog post: Analyzing the COVID-19 genome with AI and data mining techniques (paper + data + code)

Another application of SPM for COVID-19 analysis is to discover sequential patterns from clinical records of patients who have been diagnosed with COVID-19 or are at risk of developing it. These patterns can help identify risk factors, symptoms, comorbidities, treatments, outcomes, or complications associated with COVID-19. For example, Cheng et al. (2017) proposed a framework for early assessment on chronic diseases by mining sequential risk patterns with time interval information from diagnostic clinical records using SPM and classification modeling techniques.

Research paper:

Cheng YT, Lin YF, Chiang KH, Tseng VS. (2017) Mining Sequential Risk Patterns From Large-Scale Clinical Databases for Early Assessment of Chronic Diseases: A Case Study on Chronic Obstructive Pulmonary Disease. IEEE J Biomed Health Inform. 2017 Mar;21(2):303-311. 

Related papers:
Ou-Yang, C., Chou, S. C., Juan, Y. C., & Wang, H. C. (2019). Mining Sequential Patterns of Diseases Contracted and Medications Prescribed before the Development of Stevens-Johnson Syndrome in Taiwan. Applied Sciences, 9(12), 243

Lee, E. W., & Ho, J. C. (2019). FuzzyGap: Sequential Pattern Mining for Predicting Chronic Heart Failure in Clinical Pathways. AMIA Summits on Translational Science Proceedings, 2019, 222.

Sequential pattern mining for E-learning

E-learning is the use of electronic media and information technologies to deliver education and training. E-learning systems have undergone rapid growth in the current decade and have generated a tremendous amount of e-learning resources that are highly heterogeneous and in various media formats. This information overload has led to the need for personalization in e-learning environments, to provide more relevant and suitable learning resources to learners according to their individual needs, objectives, preferences, characteristics, and learning styles.

SPM can be used to discover sequential patterns of learners’ behavior and interactions with e-learning systems, such as the sequences of learning resources accessed, the sequences of learning activities performed, or the sequences of learning outcomes achieved. These patterns can help understand the learning processes and preferences of learners, identify their strengths and weaknesses, provide feedback and guidance, recommend appropriate learning resources or activities, predict their performance or satisfaction, and improve the design and evaluation of e-learning systems.

There are many papers related to this. Some of the earliest papers were made by me during my Ph.D studies (in 2008).

Research papers:

Song, W., Ye, W., Fournier-Viger, P. (2022). Mining sequential patterns with flexible constraints from MOOC data. Applied Intelligence, to appear.
DOI : 10.1007/s10489-021-03122-7

Fournier-Viger, P., Nkambou, R., Mayers, A. (2008). Evaluating Spatial Representations and Skills in a Simulator-Based Tutoring System. IEEE Transactions on Learning Technologies (TLT), 1(1):63-74.
DOI: 10.1109/TLT.2008.13

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican Intern. Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

Yau, T. S. Mining Sequential Patterns of Students’ Access on Learning Management System. Data Mining and Big Data, 191.

王文芳 (2016) Learning Process Analysis Based on Sequential Pattern Mining in a Web-based Inquiry Science Environment.

Rashid, A., Asif, S., Butt, N. A., Ashraf, I. (2013).Feature Level Opinion Mining of Educational Student Feedback Data using Sequential Pattern Mining and Association Rule Mining. International Journal of Computer Applications 81.

Point-of-Interest Recommendation

Point-of-interest (POI) recommendation is the task of recommending places or locations that users may be interested in visiting based on their preferences, context, and history. POI recommendation has many applications such as tourism, navigation, social networking, advertising, and urban planning. POI recommendation can help users discover new places that match their interests and needs, save time and effort in searching for information, enhance their travel experience and satisfaction, and increase their loyalty and engagement with service providers.

SPM can be used to discover sequential patterns of users’ mobility behavior and preferences, such as the sequences of POIs visited by users within a day or a trip. These patterns can help understand the travel patterns and preferences of users, identify their travel purposes and intentions, provide feedback and guidance, recommend appropriate POIs or routes based on their travel history or context, predict their next destination or activity, and improve the design and evaluation of POI recommendation systems.

Research papers:

Amirat, H., Lagraa, N., Fournier-Viger, P., Ouinten, Y., Kherfi, M. L., Guellouma, Y. (2022) Incremental Tree-based Successive POI Recommendation in Location-based Social Networks . Applied Intelligence, to appear.

Lee, G. H., & Han, H. S. (2019). Clustering of tourist routes for individual tourists using sequential pattern mining. The Journal of Supercomputing, 1-18.

Yang, C., & Gidófalvi, G. (2017). Mining and visual exploration of closed contiguous sequential patterns in trajectories. International Journal of Geographical Information Science, 1-23.

Some other applications

Theorem proving

Nawaz, M. S., Sun, M., & Fournier-Viger, P. (2019, May). Proof guidance in PVS with sequential pattern mining. In International Conference on Fundamentals of Software Engineering (pp. 45-60). Springer, Cham.

Fire spot identification

Istiqomah, N. (2018). Fire Spot Identification Based on Hotspot Sequential Pattern and Burned Area Classification. BIOTROPIA-The Southeast Asian Journal of Tropical Biology, 25(3), 147-155.

Sentiment analysis
Rana, T. A., & Cheah, Y. N. (2018). Sequential patterns-based rules for aspect-based sentiment analysis. Advanced Science Letters, 24(2), 1370-1374.

Pollution analysis
Yang, G., Huang, J., & Li, X. (2017). Mining Sequential Patterns of PM 2.5 Pollution in Three Zones in China. Journal of Cleaner Production.

Anomaly detection
Rahman, Anisur, et al. "Finding Anomalies in SCADA Logs Using Rare Sequential Pattern Mining." International Conference on Network and System Security. Springer International Publishing, 2016.

Authorship attribution
Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

Video games
Leece, M., Jhala, A. (2014). Sequential Pattern Mining in StarCraft: Brood War for Short and Long-Term Goals. Proc. Workshop on Adversarial Real-Time Strategy Games at AIIDE Conference. AAAI Press.

Restaurant recommendation
Han, M., Wang, Z., Yuan, J. (2013). Mining Constraint Based Sequential Patterns and Rules on Restaurant Recommendation System. Journal of Computational Information Systems, 9(10): 3901-3908.

Conclusion

In this blog post, I have introduced some representative applications of sequential pattern mining. Hope that this has been interesting. If you want to know more about sequential pattern mining, you can check my survey of sequential pattern mining and my video introduction to sequential pattern mining (24 minutes). Also, you can check out my free online pattern mining course and the SPMF open-source software, which offers hundreds of pattern mining algorithms.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Posted in Uncategorized | Leave a comment

Test your knowledge about sequential pattern mining!

Sequential pattern mining is a data mining technique used to discover frequent sequences or patterns in data. I have prepared a short test of 10 questions to evaluate your knowledge of sequential pattern mining. The test is not very hard. It is just for fun.

So are you ready to test your knowledge about sequential pattern mining?

Here are 10 questions:

Questions

  1. What is the main goal of sequential pattern mining?
  2. What is the difference between sequential pattern mining and association rule mining?
  3. What are some common applications of sequential pattern mining?
  4. What is the minimum support threshold in sequential pattern mining?
  5. What is the difference between a sequence database and a transaction database?
  6. What is the difference between a sequence and a subsequence?
  7. What is the difference between a maximal and a closed sequential pattern?
  8. What is the difference between sequence and item constraints in sequential pattern mining?
  9. What is the difference between vertical and horizontal data formats in sequential pattern mining?
  10. What are some challenges in sequential pattern mining?

Answers

  1. The main goal of sequential pattern mining is to discover frequent sequences or patterns in a sequence database (a set of sequences).
  2. Association rule mining focuses on finding strong associations between items within transactions. It does not consider the time dimension or the sequential ordering between items. On the other hand, sequential pattern mining focuses on finding relationships between items across transactions over time. More precisely, sequential pattern mining aims at finding subsequences that appear frequently in a set of sequences.
  3. Some common applications of sequential pattern mining include market basket analysis, web usage mining, authorship attribution, malware detection, and bioinformatics.
  4. The minimum support threshold is a user-defined parameter that specifies the minimum number of times a sequence must appear in the data to be considered frequent.
  5. A sequence database contains sequences of ordered events or items, while a transaction database contains unordered sets of items.
  6. A sequence is an ordered list of events or items, while a subsequence is a subset of events or items from a sequence that maintains their relative order.
  7. A maximal sequential pattern is a frequent sequence that is not a subsequence of any other frequent sequence, while a closed sequential pattern is a frequent sequence that has no super-sequence with the same support.
  8. Sequence constraints specify conditions on the order and timing of events or items within a sequence, while item constraints specify conditions on the presence or absence of specific items within a sequence.
  9. In horizontal data format, each sequence is represented as a row ans columns are items, while in vertical data format, each row is an item and columns indicate the presence of items in sequences.
  10. Some challenges in sequential pattern mining include handling large datasets, dealing with noise and missing data, improving the runtime and memory consumption, and incorporating constraints. There are of course, many other challenges that could be mentioned.

Conclusion

How did you do on the quiz? 😊 Let me know in the comment section below!


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 180 data mining algorithms.

Posted in Pattern Mining | Tagged , , , , , , , , | Leave a comment

Test your knowledge about pattern mining!

Today, I publish a small quiz to test your knowledge about pattern mining. There are 10 questions, which are then followed by the answers. Those questions are not very hard 😉 How many can you get right?

Questions

  1. What is the goal of pattern mining?
  2. What is the difference between frequent pattern mining and sequential pattern mining?
  3. What is the Apriori algorithm used for?
  4. What is the difference between association rule mining and correlation analysis?
  5. What is the minimum support threshold in frequent pattern mining?
  6. What is the minimum confidence threshold in association rule mining?
  7. What is a closed pattern?
  8. What is the difference between itemset mining and sequence mining?
  9. What is the difference between vertical data format and horizontal data format in itemset mining?
  10. What is the difference between objective measures and subjective measures in pattern evaluation?

Answers

  1. The goal of pattern mining is to discover interesting, previously unknown patterns from large datasets.
  2. Frequent pattern mining focuses on finding frequently occurring patterns within a dataset, while sequential pattern mining focuses on finding frequently occurring patterns where the order of items matters.
  3. The Apriori algorithm is used for frequent itemset mining.
  4. Association rule mining focuses on finding relationships between items within a dataset, while correlation analysis focuses on finding relationships between variables within a dataset.
  5. The minimum support threshold in frequent pattern mining is the minimum number of times a pattern must occur within a dataset to be considered frequent.
  6. The minimum confidence threshold in association rule mining is the minimum probability that an item will appear in a transaction given that another item appears in that transaction.
  7. Closed patterns are patterns that cannot be extended by adding more items without decreasing their support. Some related concepts are generator patterns and maximal patterns.
  8. Itemset mining focuses on finding sets of items that frequently occur together within a dataset, while sequence mining focuses on finding sequences of items that frequently occur together within a dataset.
  9. In itemset mining, the vertical data format stores data as a list of transactions, where each transaction (record) contains a list of items, while horizontal data format stores data as a list of items, where each item contains a list of transactions in which it appears.
  10. Objective measures evaluate patterns based on statistical properties of the data, while subjective measures evaluate patterns based on their usefulness or interestingness to the user.

Tell me how many questions you answered right in the comment section below!


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Pattern Mining | Tagged , , , | 3 Comments

Visualizing the item frequency distribution of pattern mining datasets

In this blog post, I will explain a quick and easy way of visualizing the frequency distribution of items in a dataset in SPMF format for pattern mining.

To do this, we will use a new online tool that I have put on my website for visualizing the frequency distribution of words in a text file. This tool can be used for any text files. But here, I will demonstrate how it can be used for analyzing a pattern mining dataset.

Step 1: First, we will download the retail.txt dataset, which is a popular dataset used in frequent itemset mining. This dataset is available on the SPMF dataset webpage. But here is the direct link for downloading this dataset:
https://www.philippe-fournier-viger.com/spmf/datasets/retail.txt

Step 2: Then, we will open the online tool for analyzing the frequency distribution of items in a text file : https://www.philippe-fournier-viger.com/tools/draw_frequency_distribution.php

Step 3: Then, we will click here on the following button and select the retail.txt file as input:

Step 4: Then, we could set various options but we will just keep the default options for now:

Step 5: Then, we will click the Draw button:

Step 6: This will display a chart showing the frequency distribution of items in this dataset. The X axis indicates different frequency (support) values, while the Y axis indicate the corresponding number of different items that have a given frequency value. The result is like this:

As we can see from this picture, many items have a low frequency (support) and very few items have a high frequency. We can also see some statistics:

There are a total of 908576 item occurrences in the datasets (when counting duplicates), and only 16470 distinct items. And we can also see the top 20 most frequent items:

Step 7: Now, if we want to better view the results displayed in the chart, we can change the maximum value for the X axis to 50, and then click Draw again:

The results now look like this:

Step 8: After we have done this, we can also use the following buttons to export this chart as a PNG image file, or to export the data used for this chart as a CSV file, or as a Latex PGFPlots file.

For example, if I export the data from the chart to Latex, the result is this:

Here it does not looks so good as there is too many labels on the X axis. It would then require further tuning of the Latex code to better display the labels on the X axis, which I will not do now.

Conclusion

This was just a short blog post to show an interesting new tool for analyzing the frequency distribution of items in a dataset in SPMF format, and also more generally for words in any text files. This can be used for various types of datasets.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Pattern Mining, spmf | Tagged , , , , , , , , , , | Leave a comment

An Online Tool to Draw FP-Trees

This blog post is to introduce a new tool made of HTML5 and JavaScript for drawing FP-trees, which you can access here:
 https://www.philippe-fournier-viger.com/tools/draw_fptree.php

If you are not familiar with FP-trees, the FP-tree is a data structure used in the field of data mining by the classical FP-Growth algorithm to discover frequent itemsets. The FP-tree is a type of prefix-tree (trie) structure, in which transactions are inserted.

The new webpage allows drawing FP-trees by entering a set of transactions in a text area, choosing a minimum support threshold and then clicking a button to draw the tree. There are also some other options to customize the result such as some checkboxes to decide whether to draw the header table and node links of the fp-tree or not.

Now, let met me show you the user interface of this tool:

The interface is very straightforward. To use this tool, we must first input a set of transactions using the SPMF format. Thus, we must write one transaction per line, where each transactions is a list of integers (items) separated by spaces. For example, lets say that I enter the following three transactions:

1 2 3 5
2 3 4
1 4 5

and set minsup = 1 transaction, and then if I click the “Draw FP-Tree” button, the result is:

The meaning is as follows:

This type of picture displays the header table, node-links and fp-tree. The webpage also provides the option of only drawing the tree:

Or just drawing the tree and header table:

Let me now show you are more complex example, for another set of transactions:

1 2 3 5
2 3 4
1 4 5
5 6 7 8 9
1 2
2 3
3 4
4
5
6 7 8 9

If I set minsup = 1 transaction, the following fp-tree is drawn:

Then, if we raise the minsup threshold to 3 transactions, a smaller fp-tree is obtained:

And if we increase to minsup = 4 transactions, we obtain this fp-tree:

If you want to see more possibility, you can try it! And if you find some bugs, you may send me an e-mail to let me know. It is possible that there is still some bugs for some special cases.

Conclusion

In this blog post, I have introduced a new tool for drawing FP-trees using HTML5 and Javascript. Hope that it is interesting. Contrarily to some other tools for drawing FP-trees, this tool not only draw the tree but also the header table and node-links.

By the way, in another blog post, I have also shown how to draw FP-trees using Latex.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Data Mining, Data science, Pattern Mining | Tagged , , , , , , , , | Leave a comment

A yet more efficient high utility itemset mining algorithm (HAMM, to appear in IEEE TKDE)

Today, I will share the good news that I have participated as co-author in a paper proposing a new algorithm for high utility itemset mining called HAMM that is very efficient, and outperforms the state-of-the-art algorithm. This is not an easy feat as there has been many algorithms for this problem since more than a decade. The algorithm will appear in an upcoming issue of the IEEE TKDE journal (one of the top journals in data mining).

What is especially interesting about this new HAMM algorithm is that it is inspired by FP-Growth but runs in a single phase unlike the previous FP-Growth based algorithms for utility mining such as UP-Growth and IHUP. Besides, it is combined with the concept of utility vectors and novel strategies are developed in HAMM

Currently, the paper is accepted but not published yet in a regular issue of TKDE, thus I will wait for the final version to be ready before sharing.

However stay tuned! And here is the citation for this new paper:

J. -F. Qu, P. Fournier-Viger, M. Liu, B. Hang and C. Hu, “Mining High Utility Itemsets Using Prefix Trees and Utility Vectors,” in IEEE Transactions on Knowledge and Data Engineering, doi: 10.1109/TKDE.2023.3256126.

And the webpage of TKDE where the paper is located: https://ieeexplore.ieee.org/document/10068302

The performance results are excellent compared to many of the most recent high utility itemset mining algorithms. Here are a few figures from the paper:


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Pattern Mining, Utility Mining | Tagged , , , , , , , , , | 2 Comments

How to count the frequency of words and ngrams in a text document? – Two online tools

I have added two new webpages offering tools to count the frequency of each word and ngrams (consecutive sequences of words) in a text document. These webpages can be found here:

The Word Frequency Counter

First, let me show you the Word Frequency Counter so that you know what it can do.
The user interface is very user-friendly and simple. It is shown below:

word frequency counter interface

To use the tool, we must first select a file to upload, and then select some options such as to ignore case and puncutations. We can also choose the type of output format such as the CSV format, tab-separated format, and SPMF format. Then we must click the “calculate frequency button” to run the program.
For instance, lets say that I try with a small text file from my computer called TODO.txt. The results look like this:

We can see the words with their frequencies, sorted by descending order of frequency. In this case, I chose the default output format. But other output format are offered:

If I choose CSV format, the results look like this:

If I choose the Tab-separated format, the results look like this:

And if I choose the SPMF format, the results look like this:

The Ngram Frequency Counter

Now let me show you the Ngram Frequency Counter. The user interface is very similar, except that we have an additional parameter which is the ngram size. For example, if we set the ngram size to 3, the tool will count the frequency of all sequences of 3 words (that are consecutive in a text document).

Let me show you an example, with the same example text file and an n-gram size of 3:

Now, let’s say that I change the ngram size to 4, here is the result:

As for the other tools, we can also change the output format.

Conclusion

This was just a short blog post to show you two new tools offered on my website, namely the Word Frequency Counter and the Ngram Frequency Counter. Those tools are simple and implemented in HTML5 + javascript.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Data Mining, Data science | Tagged , , , , , , , , | Leave a comment

Introduction to Rare Itemset Mining

In the last decades, many data mining algorithms were developed to find interesting patterns in data (pattern mining). In particular, a lot of studies have been done about discovering frequent patterns (patterns that appear frequently) in a data. A classic example of this type of problem is frequent itemset mining, which aims at finding sets of values that frequently appear together in a database. This data mining task has many applications. For instance, it can be used to find all the itemsets (sets of products) that customers buy frequently in a database of customer transactions.Though discovering frequent patterns in data has many applications, there is an underlying assumption that patterns that frequently appear in data are important. But in real-life that is not always the case. For example, by analyzing data from a store, we may find that people very frequently buy bread and milk together but this pattern may not be interesting because it is very common, and thus it is not surprising at all that customers buy these products together.

Thus, in some cases, it is interesting to search for patterns in data that are rare. Finding rare patterns has thus drawn the interest of many researchers. It has for example applications such as to study rare diseases, rare occurrences of network failures, and suspicious behavior.In this blog post, I will thus talk about the problem of finding patterns in data that are infrequent (rare), which is known as the problem of rare itemset mining.

Some important definitions

To explain what is rare itemset mining, let me first introduce some definitions. To make it easy to understand, I will explain using the example of analyzing shopping data. But rare itemset mining is not limited to this application. Let take for example the table below, which shows a database of customer transactions made in a store by different customers:

This database contains four transactions called T1, T2, T3 and T4, corresponding to four customers. The first transaction (T1) indicates that a customer has bought pasta, lemon, bread and orange. The second transaction (T2) indicates that a customer has bought pasta and lemon together. Transaction T3 indicates that a customer bought pasta, orange and cake, while transaction T4 indicates that someone bought pasta, lemon, orange with cake.

If we have such data, we can apply pattern mining algorithms to discover interesting pattern in it. Several algorithms are available for this. The traditional task is to find the frequent itemset mining, that is all the sets of items (itemsets) that appear more than minsup times in a database, where minsup is a threshold set by the user. For example, if the user set minsup = 2, then the goal of frequent itemset mining is to find all sets of items that have been purchased at least twice by customers. For instance, in the above database, the frequent itemsets are the following:

For instance, {lemon} is said to be a frequent itemset because it appears at least twice in the database above. In fact, lemon actually appear three times (in transactions, T1, T2 and T4). Another example is the itemset {pasta, cake}, which is said to be frequent because it appears in two transactions (T3 and T4). Generally, if an itemset appears in X transactions, we say that an itemset has a support of X transactions. For instance, we say that the itemset {lemon} has a support of 3 and the itemset {pasta, cake} has a support of 2. If we want to visualize the frequent itemsets and all the possible itemsets, we can draw a picture called a Hasse Diagram, shown below:

In this picture, we have all the possible itemsets (combinations of items) that customers could buy together. To make the picture easier to read, we use the letters l, p, b, o, c to represents the items lemon, pasta, bread, orange, and cake, respectively. So for example, the notation “poc” means to buy pasta, orange and cake together. In the picture, the frequent itemsets are colored in yellow, while the infrequent itemsets are colored in white. And there is an edge between two itemsets if one is included in the other and the other has one more item. For example, there is an edge from “po” to “poc” because “po” is included in “poc” and “poc” has one more item. Totally, if we have five items such as in this case, there is a total of 2^5 = 32 possible itemsets, as depicted in this picture. The itemset on top of this picture is the empty set, while the one at the bottom is “lpboc”, which means to buy all the five items together.The problem of frequent itemset mining thus aims at finding values that appear frequently together (the yellow itemsets0. But what if we want to find the itemsets that are rare instead of frequent? To do this, we first need to define what we mean by rare itemset. There are a few definitions. I will give a brief overview.

Definition 1: the infrequent itemsets

The most intuitive definition of what is a rare itemset is that it is an itemset that is not frequent. Thus, we can say that a rare itemset is an itemset that has a support that is less than the minsup threshold set by the user. Formally, we can write that an itemset X is infrequent if sup(X) < minsup, where sup(X) represents the support of X.

For example, if minsup = 2, the infrequent itemsets are illustrated in orange color in the picture below:

All these itemsets are infrequent because their support is less than 2. For example the itemset “bo” (bread and orange) has a support of 1, while “bc” (bread and cake) has a support of 0.Is this a good definition? No, because there are just too many infrequent itemsets. And in fact, it can be observed that many of them never appear in the database. For example, “pboc” (pasta, bread, orange and cake) is an infrequent pattern, but it has a support of zero because no customer has bought these items together. It is obviously uninteresting to find such patterns that have a support of zero. Itemsets having a support of zero are illustrated with a X in the picture below:

Thus, we may try to find a better definition of what is a rare itemset so that we will not find too many rare itemsets.

Definition 2: minimal rare itemsets

An alternative definition of what is a rare itemsets was proposed. An itemset X is said to be a minimal rare itemset if sup(X) <= minsup and all (proper) subsets of X are frequent itemsets. To make this easier to understand, I have illustrated the minimal rare itemsets on the picture below for minsup = 2 for the same example.


Here we can see that there are only two minimal rare itemsets, which are “lc” and “b”. “lc” is a minimal rare itemset because its support is less than 2, and all its subsets (“l” and “c”) are frequent. And “b” is a minimal rare itemset its suport is less than 2 and all its subsets are frequent (in this case, the only subset of “b” is the empty set, which is always frequent by definition).

Thus, if we find only the minimal rare itemsets, we may find a small set of itemsets that are almost frequent.

The first algorithm that was proposed to find minimal rare itemsets is AprioriRare by Szathmary & al (2007) in this paper:

Laszlo Szathmary, Amedeo Napoli, PetkoValtchev:Towards Rare Itemset Mining. ICTAI (1) 2007: 305-312

An efficient open-source implementation of AprioriRare can be found in the SPMF software.

Definition 3: perfectly rare itemsets

Another definition of what is a rare itemset was proposed by Koh et al. (PAKDD 2005). They define the concept of perfectly rare itemset as follows:

In that definition, the key idea is to use two threshold minsup and maxsup. Then, the aim is to find all itemsets that have a support no less than minsup and contain items with a support less than maxsup.

To show and example, of the type of results that this provides, here is the perfectly rare itemsets for minsup = 1 and maxsup = 1.9 for the same example:

And if we change the parameters a little bit, here is the result:

The first algorithm for mining perfectly rare itemset is AprioriInverse by Koh et al. in this paper:

Yun Sing Koh, Nathan Rountree: Finding Sporadic Rules Using AprioriInverse. PAKDD 2005: 97-106

An open-source implementation of AprioriInverse can be found in the SPMF software.

Other extensions

There also exist other extensions and variations of the problem of rare itemset mining besides what I have covered in this blog post. My aim was to give a brief overview.

Software for rare itemset mining

If you want to try rare itemset mining, you can check the SPMF software, as I have mentioned above. It provides over 250 algorithms for various pattern mining tasks, including several algorithms for rare pattern mining. SPMF is written in Java but can be called from various languages and use also as a standalone application.

A video lecture on rare itemset mining

If you want to know more about the topic of rare itemset mining, you can also watch my video lecture on this topic, which is part of my free online course about pattern mining.

Conclusion

In this blog post, I have explained briefly what is rare itemset mining, that is how to find infrequent patterns in data. I mainly explained the different problem definitions but did not go into details about how the algorithms work to keep the blog post short. Hope you have enjoyed this. If you have any questions or feedback, please leave a comment below. Thank you for reading!

By the way, if you are new to itemset mining, you might be interested to check these two survey papers that give a good introduction to the topic:

  • Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, Wiley, e1207 doi: 10.1002/widm.1207, 18 pages.
  • Luna, J. M., Fournier-Viger, P., Ventura, S. (2019). Frequent Itemset Mining: a 25 Years Review. WIREs Data Mining and Knowledge Discovery, Wiley, 9(6):e1329. DOI: 10.1002/widm.1329


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining alg

Posted in Pattern Mining | Tagged , , , , , , , , , , | 1 Comment

Free Pattern Mining Course

Hi all, this is to let you know that I have developed a free pattern mining course that you can watch online.  The course is designed to introduce students or researchers to the different topics of pattern mining and explain the key algorithms and key concepts. I

The course can be watched online as video lectures and contains PPT slides, exercises and you can also use the SPMF software to try all algorithms that are discussed in the course, and see the source code.

The course is provided for free to share education on this topic with everyone 😉 But note that this course is in BETA because I will keep adding more content to cover more topics in the next months. Currently, the list of topics already covered are:

Hope you will enjoy this free course. If you have any feedback for improvement, you can send me an e-mail or leave a comment at the bottom of this page. I will be pleased to read your comments.

More videos on pattern mining

By the way, if you want to see more videos on pattern mining, you may also check:
– The video page on the SPMF website: SPMF: A Java Open-Source Data Mining Library (philippe-fournier-viger.com)
– My Youtube channel: https://www.youtube.com/channel/UCk26EiKTBxk1NAQniOV_oyQ/


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Pattern Mining | Tagged , , , , , , , , , | Leave a comment

What is a maximal itemset?

In data mining, an itemset is a collection of items that occur together in a transactional dataset. For example, if we have a dataset of customer purchases, an itemset could be {bread, cake, orange}, meaning that these three items were bought together by several customers.

A frequent itemset is an itemset that occurs at least a certain number of times (or percentage) in the dataset. This number or percentage is called the minimum support threshold and it is usually specified by the user (but could be set automatically). For example, if we set the minimum support threshold to 3, then {bread, milk, eggs} is a frequent itemset if it occurs in 3 or more transactions in the dataset.

A maximal frequent itemset is a frequent itemset for which none of its immediate supersets are frequent. An immediate superset of an itemset is an itemset that contains one more item than the original itemset. For example, {bread, cake, orange, pasta} is an immediate superset of {break, cake, orange}. A maximal frequent itemset is thus a frequent itemset that cannot be extended with any other item without losing its frequency.

To illustrate this concept, consider the following example:

Transaction IDItems
1A B C D
2A B D
3A C D
4B C D

Assume that the minimum support threshold is 50%, meaning that an itemset must occur in at least 2 transactions to be frequent (since this datasets has four transactions). The following table shows all the possible itemsets and their support counts (number of transactions in which they occur):

ItemsetSupport count (how many times an itemset appears in the database)
A3
B3
C3
D3
AB2
AC2
AD3
BC2
BD3
CD3
ABC1
ABD2
ACD2
BCD2
ABCD1

Based on the minimum support threshold, the frequent itemsets are A, B, C, D, AB, AC, AD, BC, BD, CD, ABD, ACD, and BCD. Out of these 13 frequent itemsets, only three are identified as maximal frequent: ABD, ACD and BCD. This is because none of their immediate supersets (ABCD) are frequent. The remaining frequent itemsets are not maximal frequent because they all have at least one immediate superset that is frequent. For example, B is not maximal frequent because AB, BC and BD are all frequent.

Why are maximal frequent itemsets useful?

Maximal frequent itemsets provide a compact representation of all the frequent itemsets for a given dataset and minimum support threshold. This is because every frequent itemset is a subset of some maximal frequent itemset. For example, in the above example, all the frequent itemsets are subsets of either ABD, ACD or BCD. Therefore, by knowing ABD, ACD and BCD and their support count, we can derive all the other frequent itemsets but we many not know their exact support count.

It is useful to discover maximal frequent itemsets because it gives a summary of all frequent itemsets to the user. Also discovering maximal frequent itemsets can be faster and require less memory and storage space than finding all the frequent itemsets. I will give an example of a real benchmark dataset called Chess to show how compact the maximal itemsets can be. If we set minsup = 40% on Chess, we get about 6 millions frequent itemsets but only 38,050 maximal frequent itemsets.

Note that besides maximal itemsets there also exists other compact representations of frequent itemsets such as closed itemsets and generator itemsets.

How to find maximal frequent itemsets?

There are several algorithms for finding maximal frequent itemsets from a transactional dataset. They are generally variations of the popular frequent itemset mining algorithm such as FPGrowth, Eclat and Apriori.

One of the most efficient algorithm for maximal itemset is FPMax, which is based on the FP-Growth algorithm. It uses a compact data structure called FP-tree. FPMax builds an FP-tree from the dataset and then mines it recursively by generating conditional FP-trees for each item. It uses several pruning strategies to avoid generating non-maximal frequent itemsets and reduce the size of the conditional FP-trees.

Some other notable algorithms for maximal itemset mining are MAFIA, LCM_MAX and GENMAX.

Source code and datasets

If you want to try discovering frequent itemsets, closed itemsets, maximal itemsets and generator, you can check the SPMF data mining software, which is implemented in Java and can be called from other popular programming languages as well. It is highly efficient an open-source, with many examples and datasets.

Video

If you are interested to learn more about this topic, you can also watch a 50 minute video lecture where I explain in more details what are maximal itemsets, closed itemsets and generator itemsets. This video is part of my free pattern mining course, and you can watch from my website here: Maximal, Closed and Generator itemsets or from my youtube channel here

Conclusion

In this blog post, I have explained briefly what is a maximal itemset, why it is useful, and talked about some algorithms for this problem. Hope that you have enjoyed this post!

By the way, if you are new to itemset mining, you might also be interested to check these two survey papers that give a good introduction to the topic:

  • Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, Wiley, e1207 doi: 10.1002/widm.1207, 18 pages.
  • Luna, J. M., Fournier-Viger, P., Ventura, S. (2019). Frequent Itemset Mining: a 25 Years Review. WIREs Data Mining and Knowledge Discovery, Wiley, 9(6):e1329. DOI: 10.1002/widm.1329


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Uncategorized | Tagged , , | 2 Comments