Some shortcomings of CSRankings

CSRankings is a popular website that provides a ranking of computer science departments around the world. The website can be found at: https://csrankings.org/ In this blog post, I will talk about this ranking and some of its shortcomings. Of course, no ranking is perfect, and what I write below is just my personal opinion.

What is CSRankings?

First, it needs to be said that there exist many rankings to evaluate computer science departments, and they use various criteria based on teaching, research or a combination of both. CSRankings is purely focused on research. It evaluates a department based on its research output in terms of articles in the very top level conferences. A good thing about that ranking is that the ranking algorithm is completely transparent and well explained: (1) it uses public data, and (2) the code to assign a score to each department is explained and is also open-source.

The ranking looks like this:

Shortcomings of CSRankings

Now, let’s talk about what I see are the main shortcomings of CSRankings:

1) The ranking ignores journal papers to focus only on conference papers but in several countries, journals are deemed more important than conference publications. Thus, there is a bias there.

2) It is a US-centric ranking. As explained in the FAQ of CSRankings, a conference is only included in this ranking if at least 50 R1 US universities have published in it during the last 10 years.

3) Some sub-fields of computer science are not well-represented and some conferences appear to be easier to publish than others. For example, from my perspective, I am a data mining researcher and KDD is arguably the top conference in my field. KDD is highly competitive with thousands of submissions, and generally an acceptance rate around 10-20%, but it is deactivated by default from CSRankings:

. But I also notice that most top data mining conferences are not included either like ICDM, CIKM etc. ICDE is another data mining related conference with an acceptance rate of about 19%. It is there but it is also deactivated by default:

I find this quite surprising because for other fields, some conferences that are arguably easier to publish than ICDE and KDD are included in the ranking. For example, ICDE and KDD typically have acceptance rate in the range of 10-20%, while for robotics the IROS and ICRA conferences are included in the CSRankings, but they have a much higher acceptance rate. IROS has an acceptance rate of around 49 % and ICRA an acceptance rate of around 43 % as can be seen below (source https://staff.aist.go.jp/k.koide/acceptance-rate.html ):

Thus, it seems to me that the ranking is unequal for different fields. It seems that some fields have some conferences that are much easier to publish that are included in the ranking than some other fields. I think that this problem emerges due to the design decision of CSRankings to only include a about 3 conferences for each research areas and to define the research areas based on ACM Special Interest Groups.

4) It is a conservative ranking that focus on big conferences for popular research areas. It does not encourage researchers to publish in new conferences but rather to focus on well-established big conferences, as everything else does not count. It also does not encourage publishing in smaller conferences that might be more relevant. For example, while doing my PhD, I was working on intelligent tutoring systems, and the two top conferences in that field are Intelligent Tutoring Systems (ITS) and Artificial Intelligence In Education (AIED). These conferences are rather small and specific, so they are totally ignored from CSRanking. But those are the conferences that matters in that field.

5) By design, the ranking focuses only on research. But some other important aspects like teaching may be relevant to some people. For example, an undergraduate student may be interested in other aspects such as how likely he will find a job after graduating. In that case, other rankings should be used.

Conclusion

That was just a quick blog post to point out what I see as some shortcomings of the CSRankings. Of course, no ranking is perfect, and CSRankings still provide some useful information and can be useful. But in my opinion, I think it has some limitations. It seems to me that not all fields are equal in this ranking.

What do you think? Post your opinion in the comment section below.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Academia | Tagged , , | Leave a comment

An Online Demo of the Eclat Algorithm

I have created a new interactive webpage to demonstrate how the Eclat algorithm is applied for frequent itemset mining. This webpage allows to enter a transaction database, select the minimum support and to see step by step what the Eclat algorithm does to produce the final result. This tool is designed for students or people who want to learn how Eclat work.

The webpage is here: Eclat Algorithm Demo

Let me show you how it works. First you have to enter a transaction database such as this:

Then, you can select a minimum support threshold value such as 2 transactions and click the Run Eclat button:

Then, all the steps of the algorithm will be displayed as well as the final result like this:

## Step 1: Convert the Transaction Data to the Vertical Format

| Item | transaction ids |
--------------------------
bread: {0, 1, 4}
milk: {0, 3, 4}
cheese: {1, 2, 3, 4}
butter: {1, 3}
eggs: {2, 3}


## Step 2: Calculate the support of each item 

| Item | transaction ids | Support |
|------|---------------------------|
bread: {0, 1, 4} support: 3
milk: {0, 3, 4} support: 3
cheese: {1, 2, 3, 4} support: 4
butter: {1, 3} support: 2
eggs: {2, 3} support: 2


## Step 3: Keep only the frequent items 

| Item | transaction ids | Support |
|------|---------------------------|
bread: {0, 1, 4} support: 3
milk: {0, 3, 4} support: 3
cheese: {1, 2, 3, 4} support: 4
butter: {1, 3} support: 2
eggs: {2, 3} support: 2


## Step 4: Sort The Frequent Items by Ascending Order of Support

| Item | Support |
|------|---------|
| butter | 2  
| eggs | 2  
| bread | 3  
| milk | 3  
| cheese | 4  

## Step 5: Start the Depth-First Search to Generate Candidates and Find the Frequent Itemsets

##### The algorithm now checks the equivalence class containing these itemsets: 
  equivalence class:  {butter}  {eggs}  {bread}  {milk}  {cheese} 

Joining candidates {butter} and {eggs} to create {butter eggs}:

Transactions of {butter}: {1, 3}
Transactions of {eggs}: {2, 3}
Transactions of {butter eggs}: {1, 3} n {2, 3} = {3}
Support of {butter eggs}: 1
Frequent: No

Joining candidates {butter} and {bread} to create {butter bread}:

Transactions of {butter}: {1, 3}
Transactions of {bread}: {0, 1, 4}
Transactions of {butter bread}: {1, 3} n {0, 1, 4} = {1}
Support of {butter bread}: 1
Frequent: No

Joining candidates {butter} and {milk} to create {butter milk}:

Transactions of {butter}: {1, 3}
Transactions of {milk}: {0, 3, 4}
Transactions of {butter milk}: {1, 3} n {0, 3, 4} = {3}
Support of {butter milk}: 1
Frequent: No

Joining candidates {butter} and {cheese} to create {butter cheese}:

Transactions of {butter}: {1, 3}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {butter cheese}: {1, 3} n {1, 2, 3, 4} = {1, 3}
Support of {butter cheese}: 2
Frequent: Yes

Joining candidates {eggs} and {bread} to create {eggs bread}:

Transactions of {eggs}: {2, 3}
Transactions of {bread}: {0, 1, 4}
Transactions of {eggs bread}: {2, 3} n {0, 1, 4} = {}
Support of {eggs bread}: 0
Frequent: No

Joining candidates {eggs} and {milk} to create {eggs milk}:

Transactions of {eggs}: {2, 3}
Transactions of {milk}: {0, 3, 4}
Transactions of {eggs milk}: {2, 3} n {0, 3, 4} = {3}
Support of {eggs milk}: 1
Frequent: No

Joining candidates {eggs} and {cheese} to create {eggs cheese}:

Transactions of {eggs}: {2, 3}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {eggs cheese}: {2, 3} n {1, 2, 3, 4} = {2, 3}
Support of {eggs cheese}: 2
Frequent: Yes

Joining candidates {bread} and {milk} to create {bread milk}:

Transactions of {bread}: {0, 1, 4}
Transactions of {milk}: {0, 3, 4}
Transactions of {bread milk}: {0, 1, 4} n {0, 3, 4} = {0, 4}
Support of {bread milk}: 2
Frequent: Yes

Joining candidates {bread} and {cheese} to create {bread cheese}:

Transactions of {bread}: {0, 1, 4}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {bread cheese}: {0, 1, 4} n {1, 2, 3, 4} = {1, 4}
Support of {bread cheese}: 2
Frequent: Yes

##### The algorithm now checks the equivalence class containing these itemsets: 
  equivalence class:  {bread,milk}  {bread,cheese} 

Joining candidates {bread milk} and {bread cheese} to create {bread milk cheese}:

Transactions of {bread milk}: {0, 4}
Transactions of {bread cheese}: {1, 4}
Transactions of {bread milk cheese}: {0, 4} n {1, 4} = {4}
Support of {bread milk cheese}: 1
Frequent: No

Joining candidates {milk} and {cheese} to create {milk cheese}:

Transactions of {milk}: {0, 3, 4}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {milk cheese}: {0, 3, 4} n {1, 2, 3, 4} = {3, 4}
Support of {milk cheese}: 2
Frequent: Yes

The final result is:

butter
eggs
bread
milk
cheese
butter cheese
eggs cheese
bread milk
bread cheese
milk cheese

This is perfect for learning as you can easily experiment with different input and see the results in your browser. However, this implementation of Eclat in Javascript is not designed to be efficient. For efficient implementations of frequent itemset mining algorithms, please see the SPMF software. It offers highly efficient implementations that can run on large databases.

By the way, I also created another webpage that gives an interactive demo of the Apriori algorithm, another popular algorithm for frequent itemset mining. And if you want to learn more about pattern mining, you may also be interested to check my free online course on pattern mining.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Big data, Data Mining, Data science, Pattern Mining, Research | Leave a comment

Test your knowledge of sequential rule mining!

Do you know about sequential rule mining? It is a popular task in pattern mining that aims at finding rules in sequences. In this blog post, I will give a list of 8 questions and answers to evaluate your knowledge about sequential rule mining.

If you don’t know about sequential rule mining, you may want to read my blog post “An Introduction to Sequential Rule Mining”, which provides a brief introduction to this topic.

The questions are presented next. The answers are at the end of the blog post.

Questions

Question 1: What is a sequence database?

A) A database that contains sequences of items
B) A database that contains sequences of events
C) A database that contains sequences of itemsets
D) A database that contains sequences of strings

Question 2: What is a sequential pattern?

A) A subsequence that appears in several sequences of a database
B) A rule that predicts the next itemset in a sequence
C) A sequence that has a high support and confidence
D) A sequence that has a high frequency and probability

Question 3: What is a sequential rule?

A) A rule that predicts items that will appear after some other items in a sequence
B) A rule that predicts the next sequence in a database
C) A rule that predicts the next item in an itemset
D) A rule that predicts the next event in an event sequence

Question 4: What are some applications of sequential rule mining?

A) Analyzing customer behavior in supermarkets or online shops
B) Recommending products or services to customers based on their previous purchases
C) Optimizing marketing strategies or promotions based on customer preferences
D) All of the above

Question 5: What are some algorithms for sequential rule mining?

A) GSP, SPADE, SPAM, PrefixSpan
B) RuleGrowth, ERMiner, CMRules
C) Apriori, FP-Growth, Eclat
D) A and B

Question 6: What is the difference between a left expansion and a right expansion of a rule? (choose multiple answers as needed)

A) A left expansion adds an item to the left-hand side of a rule
B) A right expansion adds an item to the right-hand side of a rule
C) A left expansion consists of scanning items before a rule
D) A right expansion consists of scanning items after a rule

Question 7: What are some advantages of sequential rule mining compared to sequential pattern mining?

A) Sequential rule mining can capture more information about the probability of a pattern being followed, which is useful for decision-making and prediction
B) Sequential rule mining can generate more rules than patterns, which may increase the complexity and redundancy
C) Sequential rule mining is easier to understand than sequential rule mining and also always faster
D) All of the above

Question 8: What are some factors that affect the quality and quantity of sequential rules?

A) The density of the sequence database
B) The support and confidence thresholds set by the user
C) The number of different items and the average length of sequences
D) All of the above

Answers

Answer to question 1

C: A sequence database is a database that contains sequences of itemsets. An itemset is a set of items that occur together in a sequence. For example, consider the following sequence database:

Sequence IDSequence
seq1<{a}{c}{e}>
seq2<{a,d}{c}{b}{a,b,e,f}>
seq3<{d}{a,b,c}{e,f,g}>
seq4<{a,b}{d,e,f,g,h}>

This database contains four sequences of itemsets. Each itemset is enclosed by curly braces and separated by commas. For example, the first sequence contains three itemsets: {a}, {c}, and {e}.

Answer to question 2

A: A sequential pattern is a subsequence that appears in several sequences of a database. For example, the sequential pattern <{a}{c}{e}> appears in the first and second sequences of the previous database. This pattern indicates that customers who bought {a}, often bought {c} after, followed by buying {e}. The support of a sequential pattern is the number or percentage of sequences that contain it. For example, the support of <{a}{c}{e}> is 2 or 50% in the previous database.

Answer to question 3

A: A sequential rule predicts items that will appear after some other items in a sequence. It can have the form X -> Y, where X and Y are itemsets or sequential patterns. For example, the sequential rule {a} -> {c} means that if a customer buys {a}, then he is likely to buy {c} afterward. The confidence of a sequential rule is the conditional probability that Y will follow X in a sequence. For example, the confidence of {a} -> {e} is 100% in the previous database, because every time {a} appears, it is followed by {e}. The support of a sequential rule is the number or percentage of sequences that X followed by Y. For example, the support of {a} -> {c} is 2 or 50% in the previous database.

Answer to question 4

D: all of the above. It has also many other applications such as malware detection and genome sequence analysis.

Answer to question 5

D: There are many algorithms for sequential rule mining that have been proposed in the literature. Some of them are based on sequential pattern mining algorithms, while others are designed specifically for sequential rule mining. Some examples are:

  • GSP, SPADE, SPAM, PrefixSpan, CM-SPAM, CM-SPADE: These are classic sequential pattern mining algorithms that can be extended to generate sequential rules by computing the confidence of each pattern.
  • RuleGrowth, ERMiner, CMRules: These are sequential rule mining algorithms that directly mine rules without generating patterns first. They use different strategies to identify rules efficiently.

Answer to question 6

A,B: A left expansion of a rule X -> Y adds an item or itemset Z to the left-hand side of the rule. For example, a left expansion of {a} -> {c} could be {a,b} -> {c}. A right expansion of a rule X -> Y adds an item or itemset Z to the right-hand side of a rule. For example, a right expansion of {a} -> {c} could be {a} -> {c, e}. Left and right expansions are used by some sequential rule mining algorithms such as RuleGrowth and ERMiner to grow rules incrementally.

Answers to question 7

A: Sequential rule mining can capture more information about the probability of a pattern being followed by another one, which can be useful for predicting future behavior or events

Answers to question 8

D: all of the above

Conclusion

Hope you have enjoyed this little quiz about sequential rule mining. How many good answers have you got? Let me know in the comment section below.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Pattern Mining | Tagged , , , , , , , , , | Leave a comment

Having a good posture for working at the computer is important!

Today, I write a short blog post. I will not talk about data mining, but I instead want to remind all of you my readers about the importance of having a good posture when working on the computer. This is especially important if you have to work for very long periods of times like I do. Having a good posture and doing some exercises is important to stay healthy especially as we get older. You may not see any problems to work 12 hours a day on a computer sitting when you are in your 20s but problems may start to appear in your 30s or 40s. Thus, today, I want to talk a little bit about this, and show you a picture of how I have arranged my desk for work these days:

As you can guess, I work in a standing position because I start to not feel comfortable to be sitting the whole day, and I have a lot of work to do. Thus, I now alternate between a sitting and standing position. It is something that you may also want to consider if you start to feel tired or have problems with your back or legs.

Another quick recommendation is to avoid working on the laptop when possible. Working on a laptop generally result in a bad posture as the screen is too low. A solution to this is to plug the laptop to an external screen and put the screen higher or to use a wireless keyboard, and put the laptop screen higher. This is another way to improve your posture while working on the computer. Also, you may consider having a better chair or a table with adjustable height.

Finally, doing some sport is a good way to stay healthy. For example, I like to go running outside. But any sport is good and will improve your health.

This is just a short blog post to remind all my readers about this! 🙂

Posted in Other | Tagged , , , | Leave a comment

Test your knowledge about periodic pattern mining

Periodic pattern mining is a data mining technique used to discover periodic patterns in a sequence of events. Algorithms for periodic pattern mining have many applications. I have prepared 10 questions to evaluate your knowledge of periodic pattern mining. You may answer them and then check the answers at the end of this post to verify your answers. Then, you may let me know how many good answers you got in the comment section. 😉

If you don’t know about this topic, you can check my introduction to periodic pattern mining and also my list of key papers on periodic pattern mining. And you may also find source code of fast implementations in the SPMF software.

Questions

  1. What is the main goal of periodic pattern mining?
  2. What is the difference between periodic pattern mining and sequential pattern mining?
  3. What are some common applications of periodic pattern mining?
  4. What is the minimum support threshold in periodic pattern mining?
  5. What is the period length in periodic pattern mining?
  6. What is the PFPM algorithm in periodic pattern mining?
  7. What is a stable periodic pattern?
  8. What is the difference between exact and approximate periodic patterns?
  9. What is the difference between global and local periodic patterns?
  10. What are some challenges in periodic pattern mining?

Answers

  1. The main goal of periodic pattern mining is to discover periodic patterns, that is events that are repeating over time more or less regularly in a sequence of events.
  2. Sequential pattern mining focuses on finding subsequences that are common to multiple sequences, while periodic pattern mining focuses on finding repeating patterns within a single sequence of events.
  3. Some common applications of periodic pattern mining include stock market analysis, weather forecasting, customer behavior analysis, and bioinformatics.
  4. The minimum support threshold is a user-defined parameter that specifies the minimum number of occurrences that a pattern must have in a sequence.
  5. The period length is the length of time between repeating occurrences of a pattern in the input sequence.
  6. The PFPM (Prefix-projected Frequent Pattern Mining) algorithm is an algorithm for discovering frequent periodic itemsets.
  7. A stable periodic pattern is a pattern that occurs at regular intervals with little variability in a sequence of events. This is the opposite of an unstable pattern. Several algorithms exists for this such as TSPIN and LPP-Growth.
  8. Exact periodic patterns have a fixed period length and occur at regular intervals, while approximate periodic patterns have some variability in their period length and occurrence.
  9. Global periodic patterns occur throughout the entire input sequence, while local periodic patterns are patterns that have a periodic behavior only within some specific time intervals.
  10. Some challenges in periodic pattern mining include handling large datasets, dealing with noise and missing data, and incorporating constraints.

Have you succeeded to answer all the questions? Let me know how you did in the comment section 😉

Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Pattern Mining | Tagged , , , , , , , , , , | Leave a comment

An Interactive Demo of The Apriori algorithm

I have created a new website for students that provides an interactive demo of the Apriori algorithm. It allows to run Apriori in your browser and see the results step by step.

The website is here: Apriori Algorithm Demo

To use it you first have to input some data and choose a minimum support value and then click the Run Apriori button:

Then, all the steps of applying the Apriori algorithm are displayed:

Step 1: Calculate the support of single items

  • {apple} (support: 3)
  • {orange} (support: 5)
  • {milk} (support: 1)
  • {tomato} (support: 3)
  • {bread} (support: 4)

Step 2: Keep only the frequent items

{milk} was pruned because its support count (1) is less than the minimum support count (2)

  • {apple} (support: 3)
  • {orange} (support: 5)
  • {tomato} (support: 3)
  • {bread} (support: 4)

Step 3: Join frequent itemsets to create candidates of size 2

- {apple} and {orange} are joined to obtain {apple, orange}

- {apple} and {tomato} are joined to obtain {apple, tomato}

- {apple} and {bread} are joined to obtain {apple, bread}

- {orange} and {tomato} are joined to obtain {orange, tomato}

- {bread} and {bread} are joined to obtain {bread, orange}

- {bread} and {bread} are joined to obtain {bread, tomato}

Step 4: Calculate the support of candidate itemsets

  • {apple, orange} (support: 3)
  • {apple, tomato} (support: 2)
  • {apple, bread} (support: 2)
  • {orange, tomato} (support: 3)
  • {bread, orange} (support: 4)
  • {bread, tomato} (support: 3)

Step 5: Keep only the candidate itemsets that are frequent

  • {apple, orange} (support: 3)
  • {apple, tomato} (support: 2)
  • {apple, bread} (support: 2)
  • {orange, tomato} (support: 3)
  • {bread, orange} (support: 4)
  • {bread, tomato} (support: 3)

Step 6: Join frequent itemsets to create candidates of size 3

- {apple, orange} and {apple,tomato} are joined to obtain {apple, orange, tomato}

- {apple, bread} and {apple,bread} are joined to obtain {apple, bread, orange}

- {apple, bread} and {apple,bread} are joined to obtain {apple, bread, tomato}

- {bread, orange} and {bread,tomato} are joined to obtain {bread, orange, tomato}

Step 7: Calculate the support of candidate itemsets

  • {apple, orange, tomato} (support: 2)
  • {apple, bread, orange} (support: 2)
  • {apple, bread, tomato} (support: 2)
  • {bread, orange, tomato} (support: 3)

Step 8: Keep only the candidate itemsets that are frequent

  • {apple, orange, tomato} (support: 2)
  • {apple, bread, orange} (support: 2)
  • {apple, bread, tomato} (support: 2)
  • {bread, orange, tomato} (support: 3)

Step 9: Join frequent itemsets to create candidates of size 4

- {apple, bread, orange} and {apple,bread,tomato} are joined to obtain {apple, bread, orange, tomato}

Step 10: Calculate the support of candidate itemsets

  • {apple, bread, orange, tomato} (support: 2)

Step 11: Keep only the candidate itemsets that are frequent

  • {apple, bread, orange, tomato} (support: 2)

Step 12: Join frequent itemsets to create candidates of size 5

No more candidates can be generated. Total number of frequent itemsets found: 15

As well as the final result (the frequent itemsets):


That is all. With this tool, you can run the algorithm in your browser and see directly the result, which is useful for students.

If you want to learn more about the Apriori algorithm, you can check my blog post that explains the Apriori algorithm and my video lecture about Apriori ). And if you want to know more about pattern mining, please check my free online pattern mining course. Finally, if you want an efficient implementation of Apriori, please check the SPMF software, which offers highly efficient implementations of Apriori and hundreds of other pattern mining algorithms.

Also, if you are new to itemset mining, you might be interested to check these two survey papers that give a good introduction to the topic:

  • Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, Wiley, e1207 doi: 10.1002/widm.1207, 18 pages.
  • Luna, J. M., Fournier-Viger, P., Ventura, S. (2019). Frequent Itemset Mining: a 25 Years Review. WIREs Data Mining and Knowledge Discovery, Wiley, 9(6):e1329. DOI: 10.1002/widm.1329


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Data Mining, Pattern Mining | Tagged , , , , , , , , , , , , | Leave a comment

What are the applications of sequential pattern mining?

Sequential pattern mining (SPM) is a data mining technique that aims to discover patterns or subsequences that occur frequently in sequential data. Sequential data can be found in various domains such as customer transactions, web browsing, text analysis, and bioinformatics. SPM can help reveal interesting and useful insights from such data, such as customer behavior, website navigation, text structure, and gene expression.

Because I often receive the questions about what the applications of sequential pattern mining are, I will talk about this in this blog post. I will talk about several example applications of sequential pattern mining, including malware detection, COVID-19 analysis, e-learning and point-of-interest recommendation. I will briefly explain how SPM can be used for these tasks and cite some research papers that have applied SPM for these purposes.

Note: The list of applications in this page is far from exhaustive. There are hundreds of applications, and I also do not cite all the papers as there are too many. I give some of my papers as examples as I am familiar with the content.

Sequential Pattern Mining for Malware Detection

Malware is a malicious software that can harm computers and networks by stealing data, encrypting files, disrupting services, or spreading viruses. Malware detection is the task of identifying and classifying malware based on their behavior and characteristics. One of the challenges of malware detection is that malware can change their appearance or code to evade detection by using techniques such as obfuscation, packing, encryption, or metamorphism.

One way to overcome this challenge is to analyze the behavior of malware during execution, rather than their static code. A common way to represent the behavior of malware is by using the Application Programming Interface (API) calls that they make to interact with the operating system or other applications. API calls can reflect the actions and intentions of malware, such as creating files, modifying registry keys, sending network packets, or injecting code.

Sequential pattern mining can be used to discover frequent patterns of API calls made by different malware families or variants. These patterns can then be used as features for malware classification or clustering. For example, Nawaz et al. (2022) proposed MalSPM, an intelligent malware detection system based on sequential pattern mining for the analysis and classification of metamorphic malware behavior. They used a dataset that contains sequences of API calls made by different malware on Windows OS and applied SPM algorithms to find frequent API calls and their patterns. They also discovered sequential rules between API call patterns as well as maximal and closed frequent API call patterns. They then used these patterns as features for seven classifiers and compared their performance with existing state-of-the-art malware detection approaches. They showed that MalSPM outperformed these approaches in terms of accuracy, precision, recall, F1-score, and ROC-AUC.

Research papers:
Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever,

Blog post:
How to Detect and Classify Metamorphic Malware with Sequential Pattern Mining (MalSPM)

Research paper on related topic of botnet detection:
Daneshgar FF, Abbaspour M. A two-phase sequential pattern mining framework to detect stealthy P2P botnets. Journal of Information Security and Applications. 2020 Dec 1;55:102645.

Sequential Pattern Mining for Healthcare

COVID-19 is a novel coronavirus disease that emerged in late 2019 and has since spread to more than 200 countries, causing more than 5 million deaths worldwide. COVID-19 genome sequencing is critical to understand the virus behavior, its origin, how fast it mutates, and for the development of drugs/vaccines and effective preventive strategies.

SPM can be used to learn interesting information from COVID-19 genome sequences, such as frequent patterns of nucleotide bases and their relationships with each other. These patterns can help examine the evolution and variations in COVID-19 strains and identify potential mutations that may affect the virus infectivity or virulence. For example, Nawaz et al. (2021) applied SPM on a corpus of COVID-19 genome sequences to find interesting hidden patterns. They also applied sequence prediction models to evaluate if nucleotide bases can be predicted from previous ones. Moreover, they designed an algorithm to find the locations in the genome sequences where the nucleotide bases are changed and to calculate the mutation rate. They suggested that SPM and mutation analysis techniques can reveal interesting information and patterns in COVID-19 genome sequences.

Research papers:

Nawaz, S., Fournier-Viger, P., Aslam, M., Li, W., He, Y., Niu, X.(2023). Using Alignment-Free and Pattern Mining Methods for SARS-CoV-2 Genome Analysis. Applied Intelligence.

Nawaz, S. M., Fournier-Viger, P., He, Y., Zhang, Q. (2023). PSAC-PDB: Analysis and Classification of Protein Structures. Computers in Biology and Medicine, Elsevier,

Blog post: Analyzing the COVID-19 genome with AI and data mining techniques (paper + data + code)

Another application of SPM for COVID-19 analysis is to discover sequential patterns from clinical records of patients who have been diagnosed with COVID-19 or are at risk of developing it. These patterns can help identify risk factors, symptoms, comorbidities, treatments, outcomes, or complications associated with COVID-19. For example, Cheng et al. (2017) proposed a framework for early assessment on chronic diseases by mining sequential risk patterns with time interval information from diagnostic clinical records using SPM and classification modeling techniques.

Research paper:

Cheng YT, Lin YF, Chiang KH, Tseng VS. (2017) Mining Sequential Risk Patterns From Large-Scale Clinical Databases for Early Assessment of Chronic Diseases: A Case Study on Chronic Obstructive Pulmonary Disease. IEEE J Biomed Health Inform. 2017 Mar;21(2):303-311. 

Related papers:
Ou-Yang, C., Chou, S. C., Juan, Y. C., & Wang, H. C. (2019). Mining Sequential Patterns of Diseases Contracted and Medications Prescribed before the Development of Stevens-Johnson Syndrome in Taiwan. Applied Sciences, 9(12), 243

Lee, E. W., & Ho, J. C. (2019). FuzzyGap: Sequential Pattern Mining for Predicting Chronic Heart Failure in Clinical Pathways. AMIA Summits on Translational Science Proceedings, 2019, 222.

Sequential pattern mining for E-learning

E-learning is the use of electronic media and information technologies to deliver education and training. E-learning systems have undergone rapid growth in the current decade and have generated a tremendous amount of e-learning resources that are highly heterogeneous and in various media formats. This information overload has led to the need for personalization in e-learning environments, to provide more relevant and suitable learning resources to learners according to their individual needs, objectives, preferences, characteristics, and learning styles.

SPM can be used to discover sequential patterns of learners’ behavior and interactions with e-learning systems, such as the sequences of learning resources accessed, the sequences of learning activities performed, or the sequences of learning outcomes achieved. These patterns can help understand the learning processes and preferences of learners, identify their strengths and weaknesses, provide feedback and guidance, recommend appropriate learning resources or activities, predict their performance or satisfaction, and improve the design and evaluation of e-learning systems.

There are many papers related to this. Some of the earliest papers were made by me during my Ph.D studies (in 2008).

Research papers:

Song, W., Ye, W., Fournier-Viger, P. (2022). Mining sequential patterns with flexible constraints from MOOC data. Applied Intelligence, to appear.
DOI : 10.1007/s10489-021-03122-7

Fournier-Viger, P., Nkambou, R., Mayers, A. (2008). Evaluating Spatial Representations and Skills in a Simulator-Based Tutoring System. IEEE Transactions on Learning Technologies (TLT), 1(1):63-74.
DOI: 10.1109/TLT.2008.13

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican Intern. Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

Yau, T. S. Mining Sequential Patterns of Students’ Access on Learning Management System. Data Mining and Big Data, 191.

王文芳 (2016) Learning Process Analysis Based on Sequential Pattern Mining in a Web-based Inquiry Science Environment.

Rashid, A., Asif, S., Butt, N. A., Ashraf, I. (2013).Feature Level Opinion Mining of Educational Student Feedback Data using Sequential Pattern Mining and Association Rule Mining. International Journal of Computer Applications 81.

Point-of-Interest Recommendation

Point-of-interest (POI) recommendation is the task of recommending places or locations that users may be interested in visiting based on their preferences, context, and history. POI recommendation has many applications such as tourism, navigation, social networking, advertising, and urban planning. POI recommendation can help users discover new places that match their interests and needs, save time and effort in searching for information, enhance their travel experience and satisfaction, and increase their loyalty and engagement with service providers.

SPM can be used to discover sequential patterns of users’ mobility behavior and preferences, such as the sequences of POIs visited by users within a day or a trip. These patterns can help understand the travel patterns and preferences of users, identify their travel purposes and intentions, provide feedback and guidance, recommend appropriate POIs or routes based on their travel history or context, predict their next destination or activity, and improve the design and evaluation of POI recommendation systems.

Research papers:

Amirat, H., Lagraa, N., Fournier-Viger, P., Ouinten, Y., Kherfi, M. L., Guellouma, Y. (2022) Incremental Tree-based Successive POI Recommendation in Location-based Social Networks . Applied Intelligence, to appear.

Lee, G. H., & Han, H. S. (2019). Clustering of tourist routes for individual tourists using sequential pattern mining. The Journal of Supercomputing, 1-18.

Yang, C., & Gidófalvi, G. (2017). Mining and visual exploration of closed contiguous sequential patterns in trajectories. International Journal of Geographical Information Science, 1-23.

Some other applications

Theorem proving

Nawaz, M. S., Sun, M., & Fournier-Viger, P. (2019, May). Proof guidance in PVS with sequential pattern mining. In International Conference on Fundamentals of Software Engineering (pp. 45-60). Springer, Cham.

Fire spot identification

Istiqomah, N. (2018). Fire Spot Identification Based on Hotspot Sequential Pattern and Burned Area Classification. BIOTROPIA-The Southeast Asian Journal of Tropical Biology, 25(3), 147-155.

Sentiment analysis
Rana, T. A., & Cheah, Y. N. (2018). Sequential patterns-based rules for aspect-based sentiment analysis. Advanced Science Letters, 24(2), 1370-1374.

Pollution analysis
Yang, G., Huang, J., & Li, X. (2017). Mining Sequential Patterns of PM 2.5 Pollution in Three Zones in China. Journal of Cleaner Production.

Anomaly detection
Rahman, Anisur, et al. "Finding Anomalies in SCADA Logs Using Rare Sequential Pattern Mining." International Conference on Network and System Security. Springer International Publishing, 2016.

Authorship attribution
Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

Video games
Leece, M., Jhala, A. (2014). Sequential Pattern Mining in StarCraft: Brood War for Short and Long-Term Goals. Proc. Workshop on Adversarial Real-Time Strategy Games at AIIDE Conference. AAAI Press.

Restaurant recommendation
Han, M., Wang, Z., Yuan, J. (2013). Mining Constraint Based Sequential Patterns and Rules on Restaurant Recommendation System. Journal of Computational Information Systems, 9(10): 3901-3908.

Conclusion

In this blog post, I have introduced some representative applications of sequential pattern mining. Hope that this has been interesting. If you want to know more about sequential pattern mining, you can check my survey of sequential pattern mining and my video introduction to sequential pattern mining (24 minutes). Also, you can check out my free online pattern mining course and the SPMF open-source software, which offers hundreds of pattern mining algorithms.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Posted in Uncategorized | Leave a comment

Test your knowledge about sequential pattern mining!

Sequential pattern mining is a data mining technique used to discover frequent sequences or patterns in data. I have prepared a short test of 10 questions to evaluate your knowledge of sequential pattern mining. The test is not very hard. It is just for fun.

So are you ready to test your knowledge about sequential pattern mining?

Here are 10 questions:

Questions

  1. What is the main goal of sequential pattern mining?
  2. What is the difference between sequential pattern mining and association rule mining?
  3. What are some common applications of sequential pattern mining?
  4. What is the minimum support threshold in sequential pattern mining?
  5. What is the difference between a sequence database and a transaction database?
  6. What is the difference between a sequence and a subsequence?
  7. What is the difference between a maximal and a closed sequential pattern?
  8. What is the difference between sequence and item constraints in sequential pattern mining?
  9. What is the difference between vertical and horizontal data formats in sequential pattern mining?
  10. What are some challenges in sequential pattern mining?

Answers

  1. The main goal of sequential pattern mining is to discover frequent sequences or patterns in a sequence database (a set of sequences).
  2. Association rule mining focuses on finding strong associations between items within transactions. It does not consider the time dimension or the sequential ordering between items. On the other hand, sequential pattern mining focuses on finding relationships between items across transactions over time. More precisely, sequential pattern mining aims at finding subsequences that appear frequently in a set of sequences.
  3. Some common applications of sequential pattern mining include market basket analysis, web usage mining, authorship attribution, malware detection, and bioinformatics.
  4. The minimum support threshold is a user-defined parameter that specifies the minimum number of times a sequence must appear in the data to be considered frequent.
  5. A sequence database contains sequences of ordered events or items, while a transaction database contains unordered sets of items.
  6. A sequence is an ordered list of events or items, while a subsequence is a subset of events or items from a sequence that maintains their relative order.
  7. A maximal sequential pattern is a frequent sequence that is not a subsequence of any other frequent sequence, while a closed sequential pattern is a frequent sequence that has no super-sequence with the same support.
  8. Sequence constraints specify conditions on the order and timing of events or items within a sequence, while item constraints specify conditions on the presence or absence of specific items within a sequence.
  9. In horizontal data format, each sequence is represented as a row ans columns are items, while in vertical data format, each row is an item and columns indicate the presence of items in sequences.
  10. Some challenges in sequential pattern mining include handling large datasets, dealing with noise and missing data, improving the runtime and memory consumption, and incorporating constraints. There are of course, many other challenges that could be mentioned.

Conclusion

How did you do on the quiz? 😊 Let me know in the comment section below!


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 180 data mining algorithms.

Posted in Pattern Mining | Tagged , , , , , , , , | Leave a comment

Test your knowledge about pattern mining!

Today, I publish a small quiz to test your knowledge about pattern mining. There are 10 questions, which are then followed by the answers. Those questions are not very hard 😉 How many can you get right?

Questions

  1. What is the goal of pattern mining?
  2. What is the difference between frequent pattern mining and sequential pattern mining?
  3. What is the Apriori algorithm used for?
  4. What is the difference between association rule mining and correlation analysis?
  5. What is the minimum support threshold in frequent pattern mining?
  6. What is the minimum confidence threshold in association rule mining?
  7. What is a closed pattern?
  8. What is the difference between itemset mining and sequence mining?
  9. What is the difference between vertical data format and horizontal data format in itemset mining?
  10. What is the difference between objective measures and subjective measures in pattern evaluation?

Answers

  1. The goal of pattern mining is to discover interesting, previously unknown patterns from large datasets.
  2. Frequent pattern mining focuses on finding frequently occurring patterns within a dataset, while sequential pattern mining focuses on finding frequently occurring patterns where the order of items matters.
  3. The Apriori algorithm is used for frequent itemset mining.
  4. Association rule mining focuses on finding relationships between items within a dataset, while correlation analysis focuses on finding relationships between variables within a dataset.
  5. The minimum support threshold in frequent pattern mining is the minimum number of times a pattern must occur within a dataset to be considered frequent.
  6. The minimum confidence threshold in association rule mining is the minimum probability that an item will appear in a transaction given that another item appears in that transaction.
  7. Closed patterns are patterns that cannot be extended by adding more items without decreasing their support. Some related concepts are generator patterns and maximal patterns.
  8. Itemset mining focuses on finding sets of items that frequently occur together within a dataset, while sequence mining focuses on finding sequences of items that frequently occur together within a dataset.
  9. In itemset mining, the vertical data format stores data as a list of transactions, where each transaction (record) contains a list of items, while horizontal data format stores data as a list of items, where each item contains a list of transactions in which it appears.
  10. Objective measures evaluate patterns based on statistical properties of the data, while subjective measures evaluate patterns based on their usefulness or interestingness to the user.

Tell me how many questions you answered right in the comment section below!


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Pattern Mining | Tagged , , , | 3 Comments

Visualizing the item frequency distribution of pattern mining datasets

In this blog post, I will explain a quick and easy way of visualizing the frequency distribution of items in a dataset in SPMF format for pattern mining.

To do this, we will use a new online tool that I have put on my website for visualizing the frequency distribution of words in a text file. This tool can be used for any text files. But here, I will demonstrate how it can be used for analyzing a pattern mining dataset.

Step 1: First, we will download the retail.txt dataset, which is a popular dataset used in frequent itemset mining. This dataset is available on the SPMF dataset webpage. But here is the direct link for downloading this dataset:
https://www.philippe-fournier-viger.com/spmf/datasets/retail.txt

Step 2: Then, we will open the online tool for analyzing the frequency distribution of items in a text file : https://www.philippe-fournier-viger.com/tools/draw_frequency_distribution.php

Step 3: Then, we will click here on the following button and select the retail.txt file as input:

Step 4: Then, we could set various options but we will just keep the default options for now:

Step 5: Then, we will click the Draw button:

Step 6: This will display a chart showing the frequency distribution of items in this dataset. The X axis indicates different frequency (support) values, while the Y axis indicate the corresponding number of different items that have a given frequency value. The result is like this:

As we can see from this picture, many items have a low frequency (support) and very few items have a high frequency. We can also see some statistics:

There are a total of 908576 item occurrences in the datasets (when counting duplicates), and only 16470 distinct items. And we can also see the top 20 most frequent items:

Step 7: Now, if we want to better view the results displayed in the chart, we can change the maximum value for the X axis to 50, and then click Draw again:

The results now look like this:

Step 8: After we have done this, we can also use the following buttons to export this chart as a PNG image file, or to export the data used for this chart as a CSV file, or as a Latex PGFPlots file.

For example, if I export the data from the chart to Latex, the result is this:

Here it does not looks so good as there is too many labels on the X axis. It would then require further tuning of the Latex code to better display the labels on the X axis, which I will not do now.

Conclusion

This was just a short blog post to show an interesting new tool for analyzing the frequency distribution of items in a dataset in SPMF format, and also more generally for words in any text files. This can be used for various types of datasets.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Pattern Mining, spmf | Tagged , , , , , , , , , , | Leave a comment