As a speaker, we should always be ready for the unexpected…

I often say that as an invited speaker for a conference or as a teacher for a course, we need to be ready for the unexpected and be prepared for every situation that could happen. This means for example, to bring special cables or adapters that may be needed to give a talk in a new location, to have at least two copies of our presentation on different supports (e.g. laptop, USB, or email), and to arrive earlier to avoid being late.

Today, is such a day where the unexpected happened. I was a keynote speaker yesterday at an AI Innovation Think Thank forum in Shanghai, and was supposed to fly immediately after to another city (Changchun) in the evening to give another talk the next day. Long story short, the flight was delayed from 8 PM to 10 PM, and then to 4 AM before being cancelled. Thus, I only slept a few hours, and had to deal with many problems to obtain refunds from the airline company. And given that I would visibly be unable to attend the conference on time, I contacted the organizers early so that we arrange for my talk to be online. I also recorded a video of my talk in the morning that I sent to the organizers, so that they could play it if the network connection is bad for whatever reasons. This is something that was not requested but can truly make a difference as I often saw online talks in conferences were we could barely hear the speaker due to a poor internet connection, and I don’t want this to happen!

Then, as I still had to still fly from airport, I had to give my keynote talk from the airport, and find a quiet place to do it from there before boarding another flight to return home. Thus, I went early to airport to find a suitable place, and the internet connection was very good and I installed myself on a cart in a quiet place.

Also, it helps that I carry with me a portable RODE shotgun microphone that I can use to give a professional sound to my talks while on the go. This type of microphone is very good for an environment like an airport as it focuses on the sound that is directly in front of the microphone and mostly ignore surrounding noise.

I also carry with an excellent pair of headphones.

And sometimes, I also carry a tripod, a portable light, and a noise filter for my microphone as well (but not this time). But here is some pictures of different accessories that I sometimes use with a portable tripod in different situations:

I also like to carry with me a laptop stand for working on the go:

And something very useful is to have a mouse. But not any mouse. I personally highly recommend the Logitech MX Anywhere. It is a portable computer mouse that can work on basically any kind of materials, even on glass, clothing, … anything!

This is perfect when travelling. You can be sure that the mouse can be used anywhere.

So this was just a short blog post to say, that it is always better to be ready for the unexpected ๐Ÿ™‚ If you had some similar stories of unexpected things that happened to you, please share with me in the comments below.

By the way, I did not write on the blog for a little while as I had a lot of things going on recently. Now, it is better. I will post more in the coming weeks.
โ€”
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Academia | Tagged , , , | Leave a comment

Two common English errors in pattern mining papers

This is a short blog post to talk about two common errors in pattern mining research papers.

1) The first error is:

mining frequent itemsets from a database
“mining patterns from a stream”
“mining patterns over a database”
“mining patterns over a data stream”

In English, we don’t mine something from something else or over something else. We mine something in something else. So it should be “mining frequent itemsets in a database” and “mining patterns in a stream

2) The second error is:

“association rules mining”, “frequent itemsets mining”

The correct way to write is:

“association rule mining”, “frequent itemset mining”

Conclusion

These two errors are very common. That is why I think it is important to mention them.

Posted in Uncategorized | Leave a comment

How to become a well-known researcher?

The other day, a young researcher asked me: what should I do to become a more well-known researcher in my field? In this blog post, I will try to answer that question.

But before that, it should be said that becoming a well-known researcher is not easy and requires hard work, dedication and a lot of motivation. However, it can have many benefits, such as attracting more funding, collaborators, citations and recognition.

Below are my advices.

1. Publish great work in excellent conferences/journals

To be more well-known, one should do some work that is impactful, relevant and original. Besides, publishing papers in respected journals and conferences will increase the likelihood that other people will read your papers. Thus, it is better to write less papers but write good papers in good venues than to write many low-quality papers and publish them in unknown conferences. The paper should also be accessible, that is written in a way that they are not too hard to read.

2. Make the code/data of your research public

Putting your code and data on a public website that everyone can access will also increase the probability that other people will use your work and thus cite your paper as well. This is a good strategy to increase your impact.

3. Build a network

Don’t just work by yourself. Go visit other research teams, make friends with other researchers and try to collaborate with top researchers in your field. This is important to build connections and other people will start to know you. Also, if you are a student, try to work in the team of a well-known professor. Attending academic conferences is also a good idea to meet other researchers and create collaborations or opportunities for the future.

In my case, I for example, travelled to many countries to build collaborations with different teams (e.g. Vietnam, Japan, New Caledonia, South Korea, Spain, Thailand, to name a few) and attended many conferences. This has been very helpful to create collaborations.

4. Write survey papers

Writing survey papers is also a good way to increase your impact in a field. Survey papers can also attract more citations than other papers, and by writing a survey papers, you can describe some research area from your perspective and also mention your own work.

5. Make yourself a website and keep it up-to-date

I think every researcher should at least have a basic website where he should put his papers, code and data freely available. This can help people to find you and find your research papers. It is highly important but I notice that many researchers don’t have a website.

6. Work on improving yourself

Top researchers are generally humble, curious, creative and work hard. Learn to accept that sometimes you will face failure and keep working to get better. Listen to the feedback from other researchers. Follows the trends in your field and try to be open to learn new topics. Try to improve your weaknesses (e.g. writing ability, oral presentation skills).

7. Organize workshops, special issues, books

If you are not a student anymore and now a full time researcher or faculty member, you may think about starting to organize workshops, special issues for journals, edit books, or even organize conferences. All of these will help you to get more well-known in your field and build relationships with other researchers.

Conclusion

This was just a short blog post to give a few tips about how to become a more well-known researcher. Some of these tips are simple but yet many researchers do not follow them (e.g. creating a website and keeping it up to date, or publishing their code and data).

โ€”
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Academia | Tagged , , , | Leave a comment

Call for tutorials at BESC 2023

This year, I am tutorial chair of BESC 2023 (THE 10TH INTERNATIONAL CONFERENCE ON BEHAVIOURAL AND SOCIAL COMPUTING) BESC 2023 (besc-conf.org) in Cyprus, from 30th October to 1st November.

I am looking for some people who could give a tutorial at this conference. If you would like to give a tutorial or know someone who is interested, please send me a proposal at philfv@qq.com. A tutorial is about 2 hours and if you cannot attend, it could be given online. All proposals will be considered but they will be evaluated based on the relevance to the conference and possible interest to audience.

Posted in cfp | Tagged , , , , | Leave a comment

Some shortcomings of CSRankings

CSRankings is a popular website that provides a ranking of computer science departments around the world. The website can be found at: https://csrankings.org/ In this blog post, I will talk about this ranking and some of its shortcomings. Of course, no ranking is perfect, and what I write below is just my personal opinion.

What is CSRankings?

First, it needs to be said that there exist many rankings to evaluate computer science departments, and they use various criteria based on teaching, research or a combination of both. CSRankings is purely focused on research. It evaluates a department based on its research output in terms of articles in the very top level conferences. A good thing about that ranking is that the ranking algorithm is completely transparent and well explained: (1) it uses public data, and (2) the code to assign a score to each department is explained and is also open-source.

The ranking looks like this:

Shortcomings of CSRankings

Now, let’s talk about what I see are the main shortcomings of CSRankings:

1) The ranking ignores journal papers to focus only on conference papers but in several countries, journals are deemed more important than conference publications. Thus, there is a bias there.

2) It is a US-centric ranking. As explained in the FAQ of CSRankings, a conference is only included in this ranking if at least 50 R1 US universities have published in it during the last 10 years.

3) Some sub-fields of computer science are not well-represented and some conferences appear to be easier to publish than others. For example, from my perspective, I am a data mining researcher and KDD is arguably the top conference in my field. KDD is highly competitive with thousands of submissions, and generally an acceptance rate around 10-20%, but it is deactivated by default from CSRankings๏ผš

. But I also notice that most top data mining conferences are not included either like ICDM, CIKM etc. ICDE is another data mining related conference with an acceptance rate of about 19%. It is there but it is also deactivated by default:

I find this quite surprising because for other fields, some conferences that are arguably easier to publish than ICDE and KDD are included in the ranking. For example, ICDE and KDD typically have acceptance rate in the range of 10-20%, while for robotics the IROS and ICRA conferences are included in the CSRankings, but they have a much higher acceptance rate. IROS has an acceptance rate of around 49 % and ICRA an acceptance rate of around 43 % as can be seen below (source https://staff.aist.go.jp/k.koide/acceptance-rate.html ๏ผ‰๏ผš

Thus, it seems to me that the ranking is unequal for different fields. It seems that some fields have some conferences that are much easier to publish that are included in the ranking than some other fields. I think that this problem emerges due to the design decision of CSRankings to only include a about 3 conferences for each research areas and to define the research areas based on ACM Special Interest Groups.

4) It is a conservative ranking that focus on big conferences for popular research areas. It does not encourage researchers to publish in new conferences but rather to focus on well-established big conferences, as everything else does not count. It also does not encourage publishing in smaller conferences that might be more relevant. For example, while doing my PhD, I was working on intelligent tutoring systems, and the two top conferences in that field are Intelligent Tutoring Systems (ITS) and Artificial Intelligence In Education (AIED). These conferences are rather small and specific, so they are totally ignored from CSRanking. But those are the conferences that matters in that field.

5) By design, the ranking focuses only on research. But some other important aspects like teaching may be relevant to some people. For example, an undergraduate student may be interested in other aspects such as how likely he will find a job after graduating. In that case, other rankings should be used.

Conclusion

That was just a quick blog post to point out what I see as some shortcomings of the CSRankings. Of course, no ranking is perfect, and CSRankings still provide some useful information and can be useful. But in my opinion, I think it has some limitations. It seems to me that not all fields are equal in this ranking.

What do you think? Post your opinion in the comment section below.

โ€”
Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Academia | Tagged , , | Leave a comment

An Online Demo of the Eclat Algorithm

I have created a new interactive webpage to demonstrate how the Eclat algorithm is applied for frequent itemset mining. This webpage allows to enter a transaction database, select the minimum support and to see step by step what the Eclat algorithm does to produce the final result. This tool is designed for students or people who want to learn how Eclat work.

The webpage is here: Eclat Algorithm Demo

Let me show you how it works. First you have to enter a transaction database such as this:

Then, you can select a minimum support threshold value such as 2 transactions and click the Run Eclat button:

Then, all the steps of the algorithm will be displayed as well as the final result like this:

## Step 1: Convert the Transaction Data to the Vertical Format

| Item | transaction ids |
--------------------------
bread: {0, 1, 4}
milk: {0, 3, 4}
cheese: {1, 2, 3, 4}
butter: {1, 3}
eggs: {2, 3}


## Step 2: Calculate the support of each item 

| Item | transaction ids | Support |
|------|---------------------------|
bread: {0, 1, 4} support: 3
milk: {0, 3, 4} support: 3
cheese: {1, 2, 3, 4} support: 4
butter: {1, 3} support: 2
eggs: {2, 3} support: 2


## Step 3: Keep only the frequent items 

| Item | transaction ids | Support |
|------|---------------------------|
bread: {0, 1, 4} support: 3
milk: {0, 3, 4} support: 3
cheese: {1, 2, 3, 4} support: 4
butter: {1, 3} support: 2
eggs: {2, 3} support: 2


## Step 4: Sort The Frequent Items by Ascending Order of Support

| Item | Support |
|------|---------|
| butter | 2  
| eggs | 2  
| bread | 3  
| milk | 3  
| cheese | 4  

## Step 5: Start the Depth-First Search to Generate Candidates and Find the Frequent Itemsets

##### The algorithm now checks the equivalence class containing these itemsets: 
  equivalence class:  {butter}  {eggs}  {bread}  {milk}  {cheese} 

Joining candidates {butter} and {eggs} to create {butter eggs}:

Transactions of {butter}: {1, 3}
Transactions of {eggs}: {2, 3}
Transactions of {butter eggs}: {1, 3} n {2, 3} = {3}
Support of {butter eggs}: 1
Frequent: No

Joining candidates {butter} and {bread} to create {butter bread}:

Transactions of {butter}: {1, 3}
Transactions of {bread}: {0, 1, 4}
Transactions of {butter bread}: {1, 3} n {0, 1, 4} = {1}
Support of {butter bread}: 1
Frequent: No

Joining candidates {butter} and {milk} to create {butter milk}:

Transactions of {butter}: {1, 3}
Transactions of {milk}: {0, 3, 4}
Transactions of {butter milk}: {1, 3} n {0, 3, 4} = {3}
Support of {butter milk}: 1
Frequent: No

Joining candidates {butter} and {cheese} to create {butter cheese}:

Transactions of {butter}: {1, 3}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {butter cheese}: {1, 3} n {1, 2, 3, 4} = {1, 3}
Support of {butter cheese}: 2
Frequent: Yes

Joining candidates {eggs} and {bread} to create {eggs bread}:

Transactions of {eggs}: {2, 3}
Transactions of {bread}: {0, 1, 4}
Transactions of {eggs bread}: {2, 3} n {0, 1, 4} = {}
Support of {eggs bread}: 0
Frequent: No

Joining candidates {eggs} and {milk} to create {eggs milk}:

Transactions of {eggs}: {2, 3}
Transactions of {milk}: {0, 3, 4}
Transactions of {eggs milk}: {2, 3} n {0, 3, 4} = {3}
Support of {eggs milk}: 1
Frequent: No

Joining candidates {eggs} and {cheese} to create {eggs cheese}:

Transactions of {eggs}: {2, 3}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {eggs cheese}: {2, 3} n {1, 2, 3, 4} = {2, 3}
Support of {eggs cheese}: 2
Frequent: Yes

Joining candidates {bread} and {milk} to create {bread milk}:

Transactions of {bread}: {0, 1, 4}
Transactions of {milk}: {0, 3, 4}
Transactions of {bread milk}: {0, 1, 4} n {0, 3, 4} = {0, 4}
Support of {bread milk}: 2
Frequent: Yes

Joining candidates {bread} and {cheese} to create {bread cheese}:

Transactions of {bread}: {0, 1, 4}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {bread cheese}: {0, 1, 4} n {1, 2, 3, 4} = {1, 4}
Support of {bread cheese}: 2
Frequent: Yes

##### The algorithm now checks the equivalence class containing these itemsets: 
  equivalence class:  {bread,milk}  {bread,cheese} 

Joining candidates {bread milk} and {bread cheese} to create {bread milk cheese}:

Transactions of {bread milk}: {0, 4}
Transactions of {bread cheese}: {1, 4}
Transactions of {bread milk cheese}: {0, 4} n {1, 4} = {4}
Support of {bread milk cheese}: 1
Frequent: No

Joining candidates {milk} and {cheese} to create {milk cheese}:

Transactions of {milk}: {0, 3, 4}
Transactions of {cheese}: {1, 2, 3, 4}
Transactions of {milk cheese}: {0, 3, 4} n {1, 2, 3, 4} = {3, 4}
Support of {milk cheese}: 2
Frequent: Yes

The final result is:

butter
eggs
bread
milk
cheese
butter cheese
eggs cheese
bread milk
bread cheese
milk cheese

This is perfect for learning as you can easily experiment with different input and see the results in your browser. However, this implementation of Eclat in Javascript is not designed to be efficient. For efficient implementations of frequent itemset mining algorithms, please see the SPMF software. It offers highly efficient implementations that can run on large databases.

By the way, I also created another webpage that gives an interactive demo of the Apriori algorithm, another popular algorithm for frequent itemset mining. And if you want to learn more about pattern mining, you may also be interested to check my free online course on pattern mining.

โ€”
Philippe Fournier-Vigerย is a full professor working in China and founder of theย SPMFย open source data mining software.

Posted in Big data, Data Mining, Data science, Pattern Mining, Research | Leave a comment

Test your knowledge of sequential rule mining!

Do you know about sequential rule mining? It is a popular task in pattern mining that aims at finding rules in sequences. In this blog post, I will give a list of 8 questions and answers to evaluate your knowledge about sequential rule mining.

If you don’t know about sequential rule mining, you may want to read my blog post “An Introduction to Sequential Rule Mining”, which provides a brief introduction to this topic.

The questions are presented next. The answers are at the end of the blog post.

Questions

Question 1: What is a sequence database?

A) A database that contains sequences of items
B) A database that contains sequences of events
C) A database that contains sequences of itemsets
D) A database that contains sequences of strings

Question 2: What is a sequential pattern?

A) A subsequence that appears in several sequences of a database
B) A rule that predicts the next itemset in a sequence
C) A sequence that has a high support and confidence
D) A sequence that has a high frequency and probability

Question 3: What is a sequential rule?

A) A rule that predicts items that will appear after some other items in a sequence
B) A rule that predicts the next sequence in a database
C) A rule that predicts the next item in an itemset
D) A rule that predicts the next event in an event sequence

Question 4: What are some applications of sequential rule mining?

A) Analyzing customer behavior in supermarkets or online shops
B) Recommending products or services to customers based on their previous purchases
C) Optimizing marketing strategies or promotions based on customer preferences
D) All of the above

Question 5: What are some algorithms for sequential rule mining?

A) GSP, SPADE, SPAM, PrefixSpan
B) RuleGrowth, ERMiner, CMRules
C) Apriori, FP-Growth, Eclat
D) A and B

Question 6: What is the difference between a left expansion and a right expansion of a rule? (choose multiple answers as needed)

A) A left expansion adds an item to the left-hand side of a rule
B) A right expansion adds an item to the right-hand side of a rule
C) A left expansion consists of scanning items before a rule
D) A right expansion consists of scanning items after a rule

Question 7: What are some advantages of sequential rule mining compared to sequential pattern mining?

A) Sequential rule mining can capture more information about the probability of a pattern being followed, which is useful for decision-making and prediction
B) Sequential rule mining can generate more rules than patterns, which may increase the complexity and redundancy
C) Sequential rule mining is easier to understand than sequential rule mining and also always faster
D) All of the above

Question 8: What are some factors that affect the quality and quantity of sequential rules?

A) The density of the sequence database
B) The support and confidence thresholds set by the user
C) The number of different items and the average length of sequences
D) All of the above

Answers

Answer to question 1

C: A sequence database is a database that contains sequences of itemsets. An itemset is a set of items that occur together in a sequence. For example, consider the following sequence database:

Sequence IDSequence
seq1<{a}{c}{e}>
seq2<{a,d}{c}{b}{a,b,e,f}>
seq3<{d}{a,b,c}{e,f,g}>
seq4<{a,b}{d,e,f,g,h}>

This database contains four sequences of itemsets. Each itemset is enclosed by curly braces and separated by commas. For example, the first sequence contains three itemsets: {a}, {c}, and {e}.

Answer to question 2

A: A sequential pattern is a subsequence that appears in several sequences of a database. For example, the sequential pattern <{a}{c}{e}> appears in the first and second sequences of the previous database. This pattern indicates that customers who bought {a}, often bought {c} after, followed by buying {e}. The support of a sequential pattern is the number or percentage of sequences that contain it. For example, the support of <{a}{c}{e}> is 2 or 50% in the previous database.

Answer to question 3

A: A sequential rule predicts items that will appear after some other items in a sequence. It can have the form X -> Y, where X and Y are itemsets or sequential patterns. For example, the sequential rule {a} -> {c} means that if a customer buys {a}, then he is likely to buy {c} afterward. The confidence of a sequential rule is the conditional probability that Y will follow X in a sequence. For example, the confidence of {a} -> {e} is 100% in the previous database, because every time {a} appears, it is followed by {e}. The support of a sequential rule is the number or percentage of sequences that X followed by Y. For example, the support of {a} -> {c} is 2 or 50% in the previous database.

Answer to question 4

D: all of the above. It has also many other applications such as malware detection and genome sequence analysis.

Answer to question 5

D: There are many algorithms for sequential rule mining that have been proposed in the literature. Some of them are based on sequential pattern mining algorithms, while others are designed specifically for sequential rule mining. Some examples are:

  • GSP, SPADE, SPAM, PrefixSpan, CM-SPAM, CM-SPADE: These are classic sequential pattern mining algorithms that can be extended to generate sequential rules by computing the confidence of each pattern.
  • RuleGrowth, ERMiner, CMRules: These are sequential rule mining algorithms that directly mine rules without generating patterns first.ย They use different strategies to identify rules efficiently.

Answer to question 6

A,B: A left expansion of a rule X -> Y adds an item or itemset Z to the left-hand side of the rule. For example, a left expansion of {a} -> {c} could be {a,b} -> {c}. A right expansion of a rule X -> Y adds an item or itemset Z to the right-hand side of a rule. For example, a right expansion of {a} -> {c} could be {a} -> {c, e}. Left and right expansions are used by some sequential rule mining algorithms such as RuleGrowth and ERMiner to grow rules incrementally.

Answers to question 7

A: Sequential rule mining can capture more information about the probability of a pattern being followed by another one, which can be useful for predicting future behavior or events

Answers to question 8

D: all of the above

Conclusion

Hope you have enjoyed this little quiz about sequential rule mining. How many good answers have you got? Let me know in the comment section below.

โ€”
Philippe Fournier-Vigerย is a full professor working in China and founder of theย SPMFย open source data mining software.

Posted in Pattern Mining | Tagged , , , , , , , , , | Leave a comment

Having a good posture for working at the computer is important!

Today, I write a short blog post. I will not talk about data mining, but I instead want to remind all of you my readers about the importance of having a good posture when working on the computer. This is especially important if you have to work for very long periods of times like I do. Having a good posture and doing some exercises is important to stay healthy especially as we get older. You may not see any problems to work 12 hours a day on a computer sitting when you are in your 20s but problems may start to appear in your 30s or 40s. Thus, today, I want to talk a little bit about this, and show you a picture of how I have arranged my desk for work these days:

As you can guess, I work in a standing position because I start to not feel comfortable to be sitting the whole day, and I have a lot of work to do. Thus, I now alternate between a sitting and standing position. It is something that you may also want to consider if you start to feel tired or have problems with your back or legs.

Another quick recommendation is to avoid working on the laptop when possible. Working on a laptop generally result in a bad posture as the screen is too low. A solution to this is to plug the laptop to an external screen and put the screen higher or to use a wireless keyboard, and put the laptop screen higher. This is another way to improve your posture while working on the computer. Also, you may consider having a better chair or a table with adjustable height.

Finally, doing some sport is a good way to stay healthy. For example, I like to go running outside. But any sport is good and will improve your health.

This is just a short blog post to remind all my readers about this! ๐Ÿ™‚

Posted in Other | Tagged , , , | Leave a comment

Test your knowledge about periodic pattern mining

Periodic pattern mining is a data mining technique used to discover periodic patterns in a sequence of events. Algorithms for periodic pattern mining have many applications. I have prepared 10 questions to evaluate your knowledge of periodic pattern mining. You may answer them and then check the answers at the end of this post to verify your answers. Then, you may let me know how many good answers you got in the comment section. ๐Ÿ˜‰

If you don’t know about this topic, you can check my introduction to periodic pattern mining and also my list of key papers on periodic pattern mining. And you may also find source code of fast implementations in the SPMF software.

Questions

  1. What is the main goal of periodic pattern mining?
  2. What is the difference between periodic pattern mining and sequential pattern mining?
  3. What are some common applications of periodic pattern mining?
  4. What is the minimum support threshold in periodic pattern mining?
  5. What is the period length in periodic pattern mining?
  6. What is the PFPM algorithm in periodic pattern mining?
  7. What is a stable periodic pattern?
  8. What is the difference between exact and approximate periodic patterns?
  9. What is the difference between global and local periodic patterns?
  10. What are some challenges in periodic pattern mining?

Answers

  1. The main goal of periodic pattern mining is to discover periodic patterns, that is events that are repeating over time more or less regularly in a sequence of events.
  2. Sequential pattern mining focuses on finding subsequences that are common to multiple sequences, while periodic pattern mining focuses on finding repeating patterns within a single sequence of events.
  3. Some common applications of periodic pattern mining include stock market analysis, weather forecasting, customer behavior analysis, and bioinformatics.
  4. The minimum support threshold is a user-defined parameter that specifies the minimum number of occurrences that a pattern must have in a sequence.
  5. The period length is the length of time between repeating occurrences of a pattern in the input sequence.
  6. The PFPM (Prefix-projected Frequent Pattern Mining) algorithm is an algorithm for discovering frequent periodic itemsets.
  7. A stable periodic pattern is a pattern that occurs at regular intervals with little variability in a sequence of events. This is the opposite of an unstable pattern. Several algorithms exists for this such as TSPIN and LPP-Growth.
  8. Exact periodic patterns have a fixed period length and occur at regular intervals, while approximate periodic patterns have some variability in their period length and occurrence.
  9. Global periodic patterns occur throughout the entire input sequence, while local periodic patterns are patterns that have a periodic behavior only within some specific time intervals.
  10. Some challenges in periodic pattern mining include handling large datasets, dealing with noise and missing data, and incorporating constraints.

Have you succeeded to answer all the questions? Let me know how you did in the comment section ๐Ÿ˜‰

Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Pattern Mining | Tagged , , , , , , , , , , | Leave a comment

An Interactive Demo of The Apriori algorithm

I have created a new website for students that provides an interactive demo of the Apriori algorithm. It allows to run Apriori in your browser and see the results step by step.

The website is here: Apriori Algorithm Demo

To use it you first have to input some data and choose a minimum support value and then click the Run Apriori button:

Then, all the steps of applying the Apriori algorithm are displayed:

Step 1: Calculate the support of single items

  • {apple} (support: 3)
  • {orange} (support: 5)
  • {milk} (support: 1)
  • {tomato} (support: 3)
  • {bread} (support: 4)

Step 2: Keep only the frequent items

{milk} was pruned because its support count (1) is less than the minimum support count (2)

  • {apple} (support: 3)
  • {orange} (support: 5)
  • {tomato} (support: 3)
  • {bread} (support: 4)

Step 3: Join frequent itemsets to create candidates of size 2

- {apple} and {orange} are joined to obtain {apple, orange}

- {apple} and {tomato} are joined to obtain {apple, tomato}

- {apple} and {bread} are joined to obtain {apple, bread}

- {orange} and {tomato} are joined to obtain {orange, tomato}

- {bread} and {bread} are joined to obtain {bread, orange}

- {bread} and {bread} are joined to obtain {bread, tomato}

Step 4: Calculate the support of candidate itemsets

  • {apple, orange} (support: 3)
  • {apple, tomato} (support: 2)
  • {apple, bread} (support: 2)
  • {orange, tomato} (support: 3)
  • {bread, orange} (support: 4)
  • {bread, tomato} (support: 3)

Step 5: Keep only the candidate itemsets that are frequent

  • {apple, orange} (support: 3)
  • {apple, tomato} (support: 2)
  • {apple, bread} (support: 2)
  • {orange, tomato} (support: 3)
  • {bread, orange} (support: 4)
  • {bread, tomato} (support: 3)

Step 6: Join frequent itemsets to create candidates of size 3

- {apple, orange} and {apple,tomato} are joined to obtain {apple, orange, tomato}

- {apple, bread} and {apple,bread} are joined to obtain {apple, bread, orange}

- {apple, bread} and {apple,bread} are joined to obtain {apple, bread, tomato}

- {bread, orange} and {bread,tomato} are joined to obtain {bread, orange, tomato}

Step 7: Calculate the support of candidate itemsets

  • {apple, orange, tomato} (support: 2)
  • {apple, bread, orange} (support: 2)
  • {apple, bread, tomato} (support: 2)
  • {bread, orange, tomato} (support: 3)

Step 8: Keep only the candidate itemsets that are frequent

  • {apple, orange, tomato} (support: 2)
  • {apple, bread, orange} (support: 2)
  • {apple, bread, tomato} (support: 2)
  • {bread, orange, tomato} (support: 3)

Step 9: Join frequent itemsets to create candidates of size 4

- {apple, bread, orange} and {apple,bread,tomato} are joined to obtain {apple, bread, orange, tomato}

Step 10: Calculate the support of candidate itemsets

  • {apple, bread, orange, tomato} (support: 2)

Step 11: Keep only the candidate itemsets that are frequent

  • {apple, bread, orange, tomato} (support: 2)

Step 12: Join frequent itemsets to create candidates of size 5

No more candidates can be generated. Total number of frequent itemsets found: 15

As well as the final result (the frequent itemsets):


That is all. With this tool, you can run the algorithm in your browser and see directly the result, which is useful for students.

If you want to learn more about the Apriori algorithm, you can check my blog post that explains the Apriori algorithm and my video lecture about Apriori ). And if you want to know more about pattern mining, please check my free online pattern mining course. Finally, if you want an efficient implementation of Apriori, please check the SPMF software, which offers highly efficient implementations of Apriori and hundreds of other pattern mining algorithms.

Also, if you are new to itemset mining, you might be interested to check these two survey papers that give a good introduction to the topic:

  • Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, Wiley, e1207 doi: 10.1002/widm.1207, 18 pages.
  • Luna, J. M., Fournier-Viger, P., Ventura, S. (2019). Frequent Itemset Mining: a 25 Years Review. WIREs Data Mining and Knowledge Discovery, Wiley, 9(6):e1329. DOI: 10.1002/widm.1329

โ€”
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Data Mining, Pattern Mining | Tagged , , , , , , , , , , , , | Leave a comment