In this blog post, I will give an introduction to **sequential pattern mining, **an important data mining task with a wide range of applications from text analysis to market basket analysis. This blog post is aimed to be a short introductino. If you want to read a more detailed introduction to **sequential pattern mining**, you can read a survey paper that I recently wrote on this topic.

**What is sequential pattern mining?**

Data mining consists of extracting information from data stored in databases to understand the data and/or take decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and pattern mining. **Pattern mining** consists of discovering interesting, useful, and unexpected patterns in databases Various types of patterns can be discovered in databases such as frequent itemsets, associations, subgraphs, sequential rules, and periodic patterns.

The task of **sequential pattern mining** is a data mining task specialized for analyzing **sequential data, **to discover** sequential patterns**. More precisely, it consists of discovering interesting subsequences in **a set of sequences**, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit. Sequential pattern mining has numerous real-life applications due to the fact that data is naturally encoded as **sequences of symbols** in many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis.

I will now explain the task of** sequential pattern mining** with an example. Consider the following** sequence database**, representing the purchases made by customers in a retail store.

This database contains four sequences. Each **sequence** represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items *a* and *b* together, then purchased an item *c*, then purchased items *f* and *g *together, then purchased an item *g*, and then finally purchased an item *e. *

**Traditionally, sequential pattern mining **is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the **frequent** **sequential patterns**. For example, in the context of our example, sequential pattern mining can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

To do **sequential pattern mining**, a user must provide a sequence database and specify a parameter called the **minimum support threshold**. This parameter indicates a minimum number of sequences in which a pattern must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task of** sequential pattern mining** consists of finding all subsequences appearing in at least 2 sequences of the input database. In the example database, many subsequences met this requirement. Some of these sequential patterns are shown in the table below, where the number of sequences containing each pattern (called the *support*) is indicated in the right column of the table**. **

For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively. The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3. These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example. Sequential pattern mining can actually be applied on database containing hundreds of thousands of sequences.

Another example of application of sequential pattern mining is text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal of sequential pattern mining is then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in this context. If you want to know more about this application, you can read this blog post, where sequential patterns are discovered in a Sherlock Holmes novel.

**Can sequential pattern mining be applied to time series?**

Besides sequences, **sequential pattern mining** can also be applied to** time series** (e.g. stock data), when discretization is performed as a pre-processing step. For example, the figure below shows a **time series** (an ordered list of numbers) on the left. On the right, a **sequence** (a sequence of symbols) is shown representing the same data, after applying a transformation. Various transformations can be done to transform a time series to a sequence such as the popular SAX transformation. After performing the transformation, any sequential pattern mining algorithm can be applied.

**Where can I get Sequential pattern mining implementations?**

To try sequential pattern mining with your datasets, you may try the open-source** SPMF data mining software**, which provides implementations of numerous** sequential pattern mining algorithms: **http://www.philippe-fournier-viger.com/spmf/

It provides implementations of several algorithms for sequential pattern mining, as well as several variations of the problem such as discovering **maximal sequential patterns**, **closed sequential patterns** and sequential rules. Sequential rules are especially useful for the purpose of performing predictions, as they also include the concept of confidence.

**What are the current best algorithms for sequential pattern mining?**

There exists several sequential pattern mining algorithms. Some of the classic algorithms for this problem are **PrefixSpan, Spade, SPAM,** and **GSP**. However, in the recent decade, several novel and more efficient algorithms have been proposed such as **CM-SPADE** and **CM-SPAM** (2014), **FCloSM **and** FGenSM** (2017), to name a few. Besides, numerous algorithms have been proposed for extensions of the problem of sequential pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential pattern mining).

**Conclusion**

In this blog post, I have given a brief overview of **sequential pattern mining**, a very useful set of techniques for analyzing sequential data. If you want to know more about this topic, you may read the following recent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequential pattern mining, extensions, research challenges and opportunities.

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). **A Survey of Sequential Pattern Mining**. Data Science and Pattern Recognition, vol. 1(1), pp. 54-77.

—

**Philippe Fournier-Viger** is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

What is the difference between sequential pattern and periodic pattern?

Hello, the basic difference is the following:

Periodic patterns are discovered in a single sequence. We want to find a pattern that periodically appears. For example, it must appear approximately every week.

Sequential patterns are discovered in a sequence database (multiple sequences). We want to find a pattern that is common to several sequences. If a pattern appears multiple times in the SAME sequence, it will still be counted as just once for sequential pattern mining algorithms.

I am working on an interesting problem now. I’d like to pattern mine stack traces like the following and teach a parser how to process them.

“Error: Something unexpected has occurred.

at main (c:\Users\Me\Documents\MyApp\app.js:9:15)

at Object. (c:\Users\Me\Documents\MyApp\app.js:17:1)

at Module._compile (module.js:460:26)”

The interesting part, that according to your definition the “\nat * (*:*:*)” is both a sequential and a periodic pattern, because it is present in many different sequences and in a single sequence multiple times. Is there a way to distinguish it from the “^*Error: *\n**$” pattern, which is present only once per sequence beside that it is more frequent? Are these periodic pattern somehow special? Any thoughts, suggestions, recommended algorithms?

Hello,

Yes, you can see a stack trace as a sequence perhaps. Then the type of error could be seen as a “class” that annotates the sequence. For example, if you use my SPMF software, you can discover some

class sequential rulesusing the TopSeqClassRules. That means a type of sequential pattern that is specific to some class such as Error:Somethime unexpected happened” and that has the form of a rule such as X –> Y. I think you could check various algorithms and see what makes the most sense for your problems. Best,Okay, thanks!

Got invalid argument exception by your app. Most of the algorithms appear to require numbers. Does any of them work on strings? Or should I just modify the code?

Internally, all the algorithms use a number based representations of items/events because it is much more efficient when searching for patterns than using strings, in terms of space and memory. For example, if you have two strings “abcdefghij” and “abcdefghik”, then you can represent them as strings or as integers. If internally, we represent as strings, then we need memory to store about 10 characters, while an integer is only 32 bits or 64 bits. Besides, comparing two integers is much faster than comparing two strings. If you compare two integers, you just use maybe one instruction of the CPU (>= < >) while comparing two strings, requires to compare each character one by one. This is the reason why these algorithms uses numbers as internal representation.

Having said that, this is also the reason why the default input format for most algorithms in SPMF is based on integers. But you can also use String. A simple way is to convert your input file from strings to integers, apply the algorithm, and then convert the integers to strings in the result. I have actually, implemented something like that in SPMF. If you are using the user interface of SPMF, you can use string as input for some algorithms like CM-SPAM. Here is a description of how to do that from the documentation of SPMF for that algorithm:

If you are not using the graphical interface of SPMF, but want to use strings, you can check example, this example of the documentation:

http://www.philippe-fournier-viger.com/spmf/Using_a_TEXT_file_as_input.php

In my case it does not help much, but I could use ASCII codes to convert a trace to a seq of numbers. Most of the traces contain ASCII characters only I think, so for testing purposes this will be just fine. As far as I can tell the rules of your input format is the following: 1 sequence in 1 line and the items are separated by space. I guess there is no way to use itemsets instead of items, but in my case itemsets are not needed.

Yes, the format of the input file for the algorithms is explained in the documentation on the webpage for each algorithm. For example, to mine sequential patterns using a normal algorithms like CM-SPAM or sequential rules with RuleGrowth, the format is:

There is an online ascii converter, so there is no need to write a single line of code:

https://www.browserling.com/tools/text-to-ascii

As far as I understand these pattern mining algorithms find subsequences or substrings. I’d like to convert my patterns to regex patterns, so parsing would be a lot easier. For example if I have the following sequences: “abbc”, “abxc”,”aybc”,”abzc”, then the “abc” will be the common subsequence with 4/4 support. But if I want to parse the sequences with “abc”, then I will get the following results: “ab(b)c”,”a(b)bc”,”ab(x)c”,”a(y)bc”,”ab(z)c”. As you can see the content of the wildcard can be either before or after the “b” item. So if I want to translate this to regex I would have 2 patterns: /ab(.*)c/”, “a(.*)bc” or we can merge them into one: /a(.*)b(.*)c/. If I expect a single capturing group always on the same place, then I have to choose between the first two regex patterns. Is there any algorithm which interprets patterns like this (so checks where the gap position is by matching), or do I have to invent some kind of converter between the subsequence patterns and the regex patterns? I could write a parser based on subsequences too, but I assume the C based regex of the actual js engine is a lot faster than writing something in pure javascript.

I started reading your survey paper ( A Survey of Sequential Pattern Mining). I had a query. You stated that an item set X is a subset of items I.

Thus an item, say A, can appear at most ‘k’ number of times in an itemset if it appears ‘k’ number of items in set I.

However, while explaining BFS approach you took I = {a,b,c} and while enumerating 2-sequences you generated as well.

I guess there is an error in this enumeration.

Also while explaining definition of support of subsequence there is a mistake in its formulated definition.

Given is:

sup(sa) = | { s | s contained in sa ^ s belongs to SDB} |

I guess it should be:

sup(sa) = | { s | sa contained in s ^ s belongs to SDB} |

Please clarify this query.

Thanks.

Hello, Thanks for your feedback.

Yes, there was an error in the definition of sup(sa), as you noticed. I have fixed it and re-uploaded the paper to my website:

http://www.philippe-fournier-viger.com/dspr-paper5.pdf

Thanks for reporting it.

For the other question, I think that something is missing in your question. Do you want to ask why the same item can appear multiple times in the same pattern? If this is your question, then yes, an item is allowed to appear multiple in a same sequence. Let me explain this more clearly. In sequential pattern mining, an item can only appear at most one time in each itemset. But a sequence can contain multiple itemsets. Thus, it is possible that the same item appear multiple times in the same sequence by appearing in several itemsets (because a sequence is a list of itemsets). For example, the sequence (a)(a) is correct because it contains two times “a” but not in the same itemset. But the sequence (a a) would be incorrect according to the definition because it contains many times the same item in the same itemset. So, to explain again, an itemset X in a sequence must be a subset of I. Thus, an item cannot appear more than once in an itemset. However, a sequence contains many itemsets. Thus, the same item is allowed to appear multiple time in the same sequence, but it must be in different itemsets.

Hope this helps! And thanks for reporting the error.

Best regards,

Yes. That was my doubt. X should be a subset of I.

If I = {a,a,b,c} then itemset {a,a} is valid.

However if I = {a,b,c} then {a,a} is invalid.

I got confused because while explaining BFS based algorithms on Page 58 there is an example that takes I = {a,b,c}. And while enumerating 2-sequences contains itemsets like {a,a}, {b,b}, {c,c}.

I wanted to point out the above fact. As I = {a,b,c}. The sequence is valid. However the sequence is not.

EDIT in the above reply:

I wanted to point out the above fact. As I = {a,b,c}. The sequence ({a}, {a}) is valid. However the sequence ({a,a}) is not.

Hello,

I understand now. Yes, you are right, there is some error in that example. I have fixed and re-uploaded the paper to my website. Thanks!

Philippe

Hello,

I have a question about the example in this article. Is correct that the pattern has a support of 2? Because for what I understood this pattern appears in sequences 1, 3 and 4. And the same for the pattern alone. Are there some errors in the example or I am wrong in the comprehension of Sequential Pattern?

Thanks.

Hello, I would like to answer your question but I am not sure about which patterns you are talking? I think that when you submitted your question, WordPress might have removed the patterns from your comment because of the > symbols.

If you can tell me which pattern, I will answer your question.

Yes, it probably thinks that they are a tags. My bad. The patters are ({b},{f,g}) and also ({f,g}) alone.

Thanks.

Hi there is nothing wrong with that. Here is the point, you should look for if the dataset (the first sequence) involves the subsequence respectively. To be more clear, b is a subsequence of (a,b) in the sequence so it satisfies the rule then look for the (f,g) I believe your trouble is with b so I will not talk about the other items.

Yes, now it is correct. At the time there was an error. The two sequential patterns that I mentioned had a support of 2 in the old image, as reported in a comment below.

Hello,

I’m curios about the difference between ‘high utility sequential pattern’ and ‘high utility itemset’. Because I’m studying on this research area in order to adopt this concept to find sequential high yield pattern from manufacturing processes. Thus, I’m trying to search several articles including ones written by yours.

While searching, I’m confused about the difference of the above two terms.

In my opinion, the main difference is ‘sequential’. Since the result of some algorithms related to ‘high utility itemset’ do not consider the order of transactions, right?

If I were right, I will able to narrow down my searching scope not ‘high utility itemset’ but ‘high utility sequential pattern’.

Thank you, Sir.

Hi,

Yes, that is correct. In

high utility itemset mining, the time or sequential order is not considered.In high utility itemset mining, the input is a transaction database ( a set of transactions performed by many customers).

The goal is then to find the group of items that yield a high profit when purchased together. But there is no order.

For example, you could find that {apple,bread} is a high utility itemset. It means that when bought together, apple and bread makes a lot of money. But it does not tell anything about the order.

In high utility

sequential pattern mining, the order is considered.Here the input is a sequence database. It means a set of sequences of transactions.

For example, you may have a database for 100 customers, and for each customer, you have a sequence of transactions that is sequentiall ordered.

Then you can find some patterns such as <(apple),(bread)> which means that buying apple followed by buying bread yield a high profit.

But a problem of high utility sequential pattern mining is that there is no concept of confidence.

If you have the pattern <(apple),(bread)> , you still don’t know likely it is that if someone buy Apple, he will then buy bread.

To solve this problem,

high utility sequential rule miningwas proposed.It is also applied on a sequence database, and will find rules such as:

Apple –> Bread confidence : 70 % which means that this rules yield a lot of money and when someone buy Apple, 70 % of the time he will buy bread after.

Besides that, if you have a single sequence instead of many sequences, you can look at episode mining. Episode mining is similar to sequential pattern mining except that patterns are mine from a single sequence instead of many sequences.

Thanks for your reply.

This comment is quite helpful to understand those concepts.

I’ve already read an article considering “USpan Algorithm”.

I think high utility sequential pattern mining is appropriate to my research problem.

Can you recommend some article about high utility sequential pattern mining except for “USpan Algorithm”?

I’m trying to find a kind of ‘Top K’ mining algorithm for high utility sequential pattern mining, but that’s not easy.

Btw, Thank you again for your kind reply.

Hello, You are welcome.

Yes, USpan is a good algorithm. The authors of that algorithm wrote a paper about a top-k algorithm for high utility sequential pattern mining, published at ICDM 2013:

https://www-staff.it.uts.edu.au/~lbcao/publication/icdm13-yin.pdf

The code of USpan is available in the SPMF data mining library as open-sourcehttp://www.philippe-fournier-viger.com/spmf/.

If you want some faster high utility sequentila pattern mining algorithms (faster than USpan), you can check CRoM and HuspExt.

Best,

Philippe

I read your paper A Survey of Sequential Pattern Mining, i dont understand the difference between closed,maximal, and generator. Could you please give me a simple review and when to used that.

Another question, i have read paper said that contiguous sequential pattern mining is better than sequential pattern mining. Could you please give me a hint, why?

I have search many papers but cant find the answers.

Thank you.

Hi,

I was wondering if you could explain briefly when a sequential pattern mining method (algorithm) is robust. How to validate the robustness of an algorithm?

Best regards,

Cristina

Hello,

In general, in computer science, an algorithm is said to be robust if it can tolerate some invalid or erroneous input. How would it apply to sequential pattern mining? Just like any computer program, you want a sequential pattern mining algorithm to work well with different kind of data, and to show some errors if the input data or parameter value(s) are not what the algorithm should expect. So this is software engineering 101. When you design a program, you should test it with different kind of data and parameters to make sure that it works in all situations, and make sure that it shows error message to the user when an error message should be shown.

Regards,

Thank you

what is the difference between frequent pattern mining and sequential pattern mining?

In simple words, in frequent itemset mining, we want to find some patterns (sets of items) that appear in many transactions. But we don’t care about the time or order between these items. For example, we could find that many customers buy bread with apple. But there is no order between bread and apple. We just know that many people buy it together

In sequential pattern mining, we also consider the order or time. Instead of finding patterns in transactions, we find patterns in some sequences. Each sequence is an ordered list of transactions. For example, we can have some sequence such that today i buy bread, tomorrow i buy milk, and after tomorrow i buy bread and milk. Thus, we have many sequences like that and we want to find the subsequences that appear in many sequences. For example, we could find the sequential pattern that many people buy milk and then after that buy bread, and then later buy coffee. Here there is some order.

In the case where all sequences only contain one transaction, then sequential pattern mining is the same as frequent itemset mining. Thus sequential pattern mining is more general.

I tried to give you the main idea in a simple way. If you want some more formal definition you could look at my survey papers on these topics which are easy to understand, and read the first few pages:

Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, e1207 doi: 10.1002/widm.1207, 18 pages. ISI, EI, SCI, Q1

ISSN: 1942-4787

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition (DSPR), vol. 1(1), pp. 54-77.

Hi Philippe,

Thank you very much for this useful introduction. I have the following questions:

1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset should be a sequential pattern since it appears in SID1 and SID2.

2. I think the support of should be 3 since it appears in SID1, SID3, and SID4 and the support of should be 3 since it appears in SID1, SID3, and SID4.

I am now reading your survey paper on frequent sequential pattern mining to gain more knowledge on this field.

Happy New Year.

Dang Nguyen

Since some errors happened, I re-write my questions:

1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset ({b, a}) should be a sequential pattern since it appears in SID1 and SID2.

2. I think the support of ({b}, {f, g}) should be 3 since it appears in SID1, SID3, and SID4 and the support of ({f, g}) should be 3 since it appears in SID1, SID3, and SID4.

Hi Dang,

Thanks for your comments and happy new year.

1. Yes, ({a,b}) has a support of 2 in the above table. As you said, there is no order in the itemsets. Thus ({a,b}) is the same as ({b,a)}.

2. Yes, that is an error in the example! I have updated the example so that ({b}, {f, g}) and ( {f, g}) have a support of 3. I will update the survey paper, which contains the same error probably tonight. Thanks for reporting this issue.

Best,

Philippe

Is it possible to use contiguous pattern mining for mining breast cancer data. I’m trying to mine the particular gene BRCA1 or BRCA2 to find the interesting patterns in it. Research direction is correct or not? Kindly give your valuable suggestios

Hi,

Thanks for reading. You could try to use that. But is it a good idea? Since I know nothing about cancer data and genes, it cannot give advice about that.

Best,

Thank u for your reply.

Mining normal and cancer data using various parameters and comparing them. This approach works?

It depends on what is your goal and what is the result. You can try it. If you can find some interesting patterns that are useful to understand cancer, then it will be interesting. Ideally you could find some patterns that you would validate with a domain expert (e.g. doctor) to show that these patterns are interesting.

Otherwise, you can work on a more technical level and work on improving the algorithms to add more features that are needed for this type of data.

Thanks a lot

First of all thanks for this amazing article.

I recently learned about SPMF and wanted to use it on my data, unfortunately I didn’t succeed to apply it properly.

My data is a time series and look like this (DATE TIME , Name):

2009-03-12 14:15:17 , James

2009-03-12 14:15:19 , James

2009-03-12 14:15:07 , James

2009-03-12 14:17:11 , Mike

2009-03-12 14:25:15 , Julie

2009-03-13 14:32:39 , Thomas

2009-03-13 14:48:46 , Mike

2009-03-13 14:51:23 , Madona

2009-03-13 14:59:47 , Julie

And so on.

I am searching for any possible pattern in this data.

More precisely I want to know “When Julie sends a message?”.

And if there is any relation between Julie’s and others’ behavior.

There are also the assumptions below:

1. I do not know if there is any relation or not.

2. Always there might be new people joining the chat room.

3. There are several “usual guests” who are always taking part in conversations and are very talkative.

4. We do not know anything about these people (and their behavior). They are just a “name” for us.

Now my question is: Is there any possibility to use SPMF to extract patterns from this time series, which could potentially be used to predict Julie’s behavior according to the others’ behavior?

Thank you again for this wonderful blog.

Hi, Glad you like the blog post and the blog 😉

Yes, this is a time-series but if you ignore the timestamps, you can also see it as a sequence and apply sequential pattern mining algorithm. The key is that if you can convert your data to the same format as some existing problem, then you can use the existing algorithms.

In your case, you could view each day of the chat room as a sequence: a sequence for 2018-01-01, a sequence for 2018-01-02, a sequence for 2018-01-03. Then each sequence would be like (mike_join, mike_post_message, julie_join, julie_post, mike_quit, julie_quit…) Thus, you would have a sequence database with a sequence for each day, and then you could apply sequential pattern mining or sequential rule mining to find patterns common to severa days… Or maybe that a day is too long. Actually, if you use days, then the sequences will be very long and the algorithm may not work well, so perhaps that dividing by hours may work better.

But I am not sure that pattern mining is the best technique for this problem.

But if you convert your data to sequences, you could also try to the sequence prediction algorithm that can be trained with sequence to predice what will happen next for a new sequence.

This is just some idea. Actually, your problem may be a bit more complex, especially if you also want to consider timestamps!

Best,

Philippe

Thanks for the detailed reply.

I will give the “sequential pattern mining” a try and see if it provides me useful results. However, I still need to know learn what to do with the noises. For example when the internet connection of some users once in a while disconnects and users logout/login randomly.

For the learning purpose, I still would like to know if there is any possibility to detect temporal patterns in this type of time-series.

For example, can we somehow come up with the below (imaginary) model:

IF (James AND [+2 minutes] Mike) THEN ([+8 minutes] Julie).

IF (Thomas AND [+16 minutes] Mike AND [+3 minutes] Madona) THEN ([+8 minutes] Julie).

Is there any technique which could be useful in such scenarios? Also in presence of noises?

Hello,

Glad my reply is useful. In my opinion, if you consider the time, it becomes much more complicated because you then need to decide whether (James AND [+60 seconds] Mike) is the same as (James AND [+61 seconds] Mike) or (James AND [+65 seconds] Mike). If you remove the time, then they become the same pattern. But if you consider time… then either all of these patterns are considered as different (but it is probably not what you want…) or you need to find a way to define what is similar such as 60 with more or less 5 minutes is considered as similar… Most paper that I have read in sequential pattern mining cannot do that. But it does not mean that it cannot be done. You could always design a new algorithm and you add some news of handling the time with more flexibility and write a paper about it.

An example of paper that consider the time in sequential pattern mining is:

http://www.philippe-fournier-viger.com/spmf/hirateyamana.pdf

However, for that paper, A +10 seconds B will be seen as different from A +11 seconds B for example…

Best,

Thank you very very much for your answers.

Now I have a better picture in mind and will follow your suggestions.

Bests 🙂

Hi Philippe,

I have been reading your survey paper that you attached in the blog post. In the page 60 of the paper, I’ve read something like ” For example, the item-positions

of the pattern hfagi in the sequence s2 of Table 1 is f(s2; 1); (s2; 4)g, indicating that the pattern hfagi

appears in the first and fourth itemsts of sequence s2. “. And follow the logic of the previous sentence in the paper, I would think the statement would be ” For example, the item-positions

of the pattern hfagi in the sequence s2 of Table 1 is f(s1; 1); (s1; 4)g, indicating that the pattern hfagi

appears in the first and fourth itemsts of sequence s2. “. Hope you can consider them and show me if I’m wrong or not. Thanks.

Best,

Nam

Hi Nam,

Thanks for reading the blog and the survey, and for your feedback! I have checked it and there is indeed an error in the definition.

It should be:

The item-positions of sa in sb, denoted as ip(sa, sb), is the set of pairs of the form.(sb, ik),Then, after that the example is correct. Thanks for letting me know. I have fixed the PDF on the website.

Best regards,

Philippe

Hi Philippe,

I have a question regarding the impact of sequence ( {a,b},{c}) differs from ({a},{b},{c}) in final sequence mining.

Hello,

Yes, these two sequences are considered different.

The sequence {a,b},{c}) means that “a” and “b” appeared at the SAME TIME, and were then followed by “c”.

The sequence ({a},{b},{c}) means that “a” was followed by “b”, and that it was then followed by “c”.

So those two patterns represent two different cases.

Hope that this is clear. And thanks for reading the blog 😉

Best,

Philippe

Pingback: Report about the DEXA 2018 and DAWAK 2018 conferences - The Data Mining BlogThe Data Mining Blog

Hi,

I am trying to do an employee attrition analysis on sample kaggle data https://www.kaggle.com/HRAnalyticRepository/employee-attrition-data/home

I am trying to identify exactly which individuals might leave based on patterns that might be present in existing data.I think frequent pattern does not make sense. Does sequence makes sense? I want to understand which kind of pattern analysis would fit here in this case. .

Hi, Thanks for reading the blog and interesting topic. However, I don’t have time to look at your data currently. But from your description, if you have sequence of events, sequential mining can perhaps make sense. But I think it requires to check in more details the data, which I cannot do at this moment.

Hello

I don’t understand this part

1 ≤ i1 < i2 < … < in ≤ m such that A1 ⊆ Bi1, A2 ⊆ Bi2 … An ⊆ Bin.

In this example is contained in

if A1 is {b}, then which one is Bi1?

Hi, This is about the definition of subsequences. It is the formal definition so I understand that it may not be so easy to understand.

The idea is the following. Let’s say that you have two sequences:

sa = A1, A2, …, An

sb = B1, B2, …, Bm

We want to define what it means for a sequence SA to be contained in a sequence sB.

We say that there must exists some numbers i1, i2, in such that 1 <= i1 < i2 ... < in < m. In other words, these numbers must correspond to positions in the sequences sB (recall that the sequence sB contains m itemsets). Then, we say that sA is contained in sB if A1, A2 ... An appears at these positions. Formally: A1 ⊆ Bi1, A2 ⊆ Bi2, ..., An ⊆ Bin In English that means that all the itemsets of sA are contained in some itemsets of sB and that the sequential order between the itemsets are the same. Hope this helps.

Hi!

Dynamic specification mining is a technique to infer specifications (in form of an automaton, LTL formulas or similar properties) from execution traces.

This seems very similar to sequential pattern mining.

Is there a difference between (dynamic) specification mining and sequential pattern mining?

Thank you very much!

Martin

Dear Martin,

This an interesting observation. There many types of patterns that can be extracted from temporal data such partial orders, sequential patterns, sequential rules, LTL formulas, automatas, etc. All of these patterns are extracted with the same purpose, which is to summarize the data or extract some important information from the data. The difference is mainly the definition of these pattern types. And then, because the definitions are different, the algorithm will also be different. Here is a few papers that have used my SPMF software for pattern mining in software engineering:

Ramdhani, M. A., Maylawati, D. S. A., Amin, A. S., & Aulawi, H. (2018). Requirements Elicitation in Software Engineering. International Journal of Engineering & Technology (UEA), 7(2.19), 772-775.

Heumüller, R., Nielebock, S., & Ortmeier, F. (2018, September). Who plays with whom?… and how? mining API interaction patterns from source code. In Proceedings of the 7th International Workshop on Software Mining (pp. 8-11). ACM.

Schulze, C., Cleaveland, R., & Lindvall, M. (2018, June). Automated Specification Extraction and Analysis with Specstractor. In International Conference on Software Engineering and Formal Methods (pp. 37-53). Springer, Cham.

Zaman, Tarannum Shaila, and Tingting Yu. “Extracting implicit programming rules: comparing static and dynamic approaches.” Proceedings of the 7th International Workshop on Software Mining. ACM, 2018.

Ramadani, J. (2017). Mining software repositories for coupled changes.

Udagawa, Y., 2016, November. Maximal frequent sequence mining for finding software clones. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (pp. 26-33). ACM.

Hello Dr. Philippe,

Thanks for this informative blog.

I have recently started working in the area of “Sequential Pattern Mining in Collaborative Filtering Based E-Commerce Recommendation Systems”.

To initiate with, I started studying your paper on “A survey on Sequential Pattern Mining” which provided me very useful details. Also, I studied about different types of recommendation systems such as Content Based,Collaborative Filtering (CF) and Hybrid etc.

I am thinking to work in the direction of :

1. How to improve the prediction accuracy of unknown ratings in user-item

rating matrix for Collaborative Filtering based recommendation models?

2. How to utilize Sequential Pattern Mining in above (1) to improve the

recommendations ? e.g by:

i) Finding the sequence of similarities/relationship

between items purchased for better recommendation

ii) Finding the time gap between different purchase

sequences to know if the items in a sequence were

purchased long time ago, they may not be suitable

for recommendation

iii) Finding the weight of each sequence or items in the

sequence which can help in prioritizing purchase

sequences

If you can share your opinion on this,I will be grateful.

Hi, It seems like an interesting topic and thanks for reading the survey paper on sequential pattern mining 😉

About e-commerce recommendation, I did not read much on this topic. But I know that some researchers have used sequential patterns for other types of recommendation. For example, a while ago I read some people using sequential patterns for studying the places that tourists will visit in a city. In that case, a sequence is a sequence of GPS location, and there is the problem of Point-of-interest recommendation. This is obviously a different problem from e-commerce recommendation but I think that you can have a look at such problem. Maybe it can give you some inspiration for your problem. I think that it is good that you are not just considering sequential pattern mining but that you also want to add additional information such as weights, etc. If you make the problem more rich by adding this additional info, your approach will be more original and may have better performance perhaps.

By the way, if you want to take the user profile into account, you can also check multi-dimensional sequential pattern mining (SeqDIM, DIMSeq). Those algorithms are implemented in my SPMF software. You can also try them. Regards.

Thank you Dr. Philippe for your detailed response and providing perspective of multidimensional sequential pattern mining. I will look into these as well.

Thanks & Regards

Hi Dr. Philippe,

What is the meaning of contiguous in SPM?

Hello, It means that all itemsets in a sequential pattern must appear consecutively in a sequence (there cannot be a gap, that is we cannot skip some itemsets to make a sequential pattern).

thanks for ur reply

your forum is main reason for my proceedings in my research. It has given an indepth knowledge about SPM. Thanks a lot for the support

Thanks. Glad it is useful! And thanks for posting.

Hello Philippe,

The sequence seem to be subsequence of SID 1 and SID 3. But why is it not listed in the table?

Sorry for my typo, I mean the sequence ({a}, {f, g})

Hi, yes you are right. ({a}, {f, g}) should also be in the table. I will update the blog post. Thanks!!

为什么没有增量序列模式挖掘，从哪能找到？

Hi, there are not many papers on incremental sequential pattern mining. Here are some papers:

Jerry Chun-Wei Lin, Tzung-Pei Hong, Wensheng Gan, Hsin-Yi Chen, Sheng-Tun Li:

Incrementally updating the discovered sequential patterns based on pre-large concept. Intell. Data Anal. 19(5): 1071-1089 (2015)

Chun-Wei Lin, Wensheng Gan, Tzung-Pei Hong, Raylin Tso:

An Incremental Algorithm for Maintaining the Built FUSP Trees Based on the Pre-large Concepts. ECC (1) 2014: 135-144

Hong, T.P., Wang, C.Y., Tseng, S.S.: An incremental mining algorithm for maintaining sequential patterns using pre-large sequences. Expert Systems with Applications 38, 7051–7058 (2011)

[c15] Chun-Wei Lin, Tzung-Pei Hong, Hong-Yu Lee, Shyue-Liang Wang:

Maintenance of Pre-large FUSP Trees in Dynamic Databases. IBICA 2011: 199-202

Chun-Wei Lin, Tzung-Pei Hong, Wen-Hsiang Lu, Hsin-Yi Chen:

An FUSP-Tree Maintenance Algorithm for Record Modification. ICDM Workshops 2008: 649-653

Chun-Wei Lin, Tzung-Pei Hong, Wen-Hsiang Lu, Wen-Yang Lin:

An Incremental FUSP-Tree Maintenance Algorithm. ISDA (1) 2008: 445-449

Lin, M.Y., Lee, S.Y.: Incremental update on sequential patterns in large databases. In: IEEE International Conference on Tools with Artificial Intelligence, pp. 24–31 (1998)

There might also be others…

thank you

你好，我在SPMF上看到这个算法“Mining Frequent Closed Sequential Patterns by Post-Processing using SPAM or PrefixSpan”的介绍，比如“Mining Frequent Closed Sequential Patterns by Post-Processing using SPAM”，我看了一下这个算法的实现，里面直接调用了ClaSP算法。这样一来，和ClaSP和Mining Frequent Closed Sequential Patterns by Post-Processing using SPAM or PrefixSpan有什么区别？在者，你能否讲述一下Mining Frequent Closed Sequential Patterns by Post-Processing using SPAM or PrefixSpan的流程。

以Mining Frequent Closed Sequential Patterns by Post-Processing using SPAM 为例说明一下即可。非常感谢。

你好！I will explain a bit. A closed sequential pattern is a sequential pattern that is not included in another sequential pattern having the same support. Thus, closed sequential patterns are a subset of the set of sequential patterns.

In the example, “Mining Frequent Closed Sequential Patterns by Post-Processing using SPAM”, the software will first apply the SPAM algorithm to find the frequent sequential patterns. Then, it will compare the sequential patterns with each other to remove the patterns that are NOT closed. The remaining patterns are the closed sequential patterns! In other words, the goal of this example is to obtain the closed sequential patterns, and to obtain these patterns, we first find all sequential patterns, and then compare them with each other to find the closed patterns.

I should also say that this may not be the most efficient way of finding the closed patterns, because we need to find all patterns and then filter out the patterns that are non closed. There are some other algorithms such as ClaSP and CM-Clasp and CloFast that will try to find the closed sequential patterns DIRECTLY. This should be faster in most case.

Hope that this helps.

Hi, Yes you are right… I just look at the code now, and it is true that it is calling ClaSP. But actually, ClaSP is an extension of SPAM. And the code is calling ClaSP with executePruningMethods = false. Thus, I think that it is applying ClaSP without the pruning strategies, which would make it more or less equivalent to applying SPAM since ClaSP is based on SPAM. But yes, I agree that it is misleading. That code was not implemented by me. If you look at the class AlgoClasp.Java, there is a method to remove patterns by post-processing:

frequentPatternEnumeration.removeNonClosedPatterns(outputPatternsFromMainMethod, keepPatterns);

I will check the code in more details beause I also see that findClosed = true when calling ClaSP… Maybe it should be false!

非常感谢您的回答。将静态的SPAM算法改为增量的SPAM算法，这非常有意义。你能修改么？我们可以一起研究将静态的将静态的SPAM算法改为增量的SPAM算法，并用Java代码实现它。

Pingback: Brief report about the MIWAI 2019 conference - The Data Mining Blog

Pingback: A brief report about the Big Data Analytics 2019 conference (BDA 2019) - The Data Mining Blog

Sir, I want to mine for sequences in a dataset of a different order.

Here is my problem https://stackoverflow.com/questions/59980637/python-preprocess-my-dataframe-to-run-sequential-pattern-mining-algorithms-pys

Thanks for reading the blog. I cannot help because I don’t use Python. Maybe the author of that code can help.

I can only provide support for the SPMF library, which I am the founder. It offers a hundred of pattern mining algorithms. Other library, I don’t use.

Hi, very helpful article.

I have a set of sequences (e.g. A -> B -> C -> A) but where each state has a different weighting where high is more important (e.g. A=10 -> B=3 -> C=4 -> A=6).

What is the best algorithm for extracting the most common sequences that allows for the weighting and do you know of any libraries and/or code to perform this analysis. I have previously been using CSPADE but do not think that it possible to introduce the concept of weights.

Many thanks,

Hi， Glad the article is useful.

For adding weights, you may use a high utility sequential pattern mining algorithms like USpan. High utility sequential pattern mining is a generalization of frequent sequential pattern mining where items can have weights and where there can also be quantities in sequences. From your message, I see that you do not need quantities. But still, you could use USpan because you have weights.

There is a fast implementation of USpan in Java as part of the SPMF data mining library, which I am the founder. This is the website: http:\\www.philippe-fournier-viger.com\spmf\ The software has a simple GUI, and can be used also from the command line and takes text files as input and output. It also has many other algorithms for sequential pattern mining with various types of constraints such as gap constraints, etc. And it is open-source.

Best regards,

Philippe

Pingback: (video) Discovering interpretable high utility patterns in databases - The Data Mining Blog

Hi,

I looked through your example above and I think there might be a few extra frequent sequences (have not looked through all comments in detail so I might have missed some details).

However fg, e and a, fg, e and b, fg, e all with support of 2 also seems to be valid frequent sequences.

You agree?

Best regards,

/ Kjell Orsborn

Hi Kjell,

Thanks for your feedback. You are correct. I have added these patterns below the table.

Best regards!

Philippe

Hi,

I have a question. So I am using sequential pattern mining and get a list of frequent sequences. And I want to count how many time it appears for each ID. For example,

lets assume we have

id1, (a), (bc), (f, d, d)

id2, (bc), (e,d)

now bc,d could be one of the frequent pattern sequences. I want to count how many times it appears in each id. For id1, we get 2, and for id2, we get 1. Is there anything we can do except brute force? Thanks!

Hi,

Thanks for reading the blog. The traditional sequential pattern mining algorithms dont count how many times a pattern appear in each sequence.

It could be done as a post-processing step after finding the sequential patterns. For example, in my SPMF software, after applying a sequential pattern mining algorithm, you will get a file containing patterns and sequence ids. Then you can apply the OCCUR algorithm on that file to calculate all the occurrences of each sequential pattern. This would not tell you exactly how many occurrences, but it would find the occurrences… so you could modify that code a little bit to count it.

Another way could be to modify the sequential pattern mining algorithm but this may not be so easy because during the search for patterns, they usually dont keep all the occurrences because they are not designed to count them.

By the way, in you example, a little problem is that in sequential pattern mining, the same item cannot appear more than once in the same itemset. So this is illegal:

(f,d, d) because ‘d’ appears more than once in the same itemset.

It should be : (f, d)

Lastly, another comment is that there are many ways of counting the number of occurrences in a sequence. For example, you can say that (bc)(d) appears once in this sequence: (bc)(d)(d) but you could also say that it appears twice… it depends on whether you want to count the overlapping occurrences or not. There is some advantages or disadvantages for different ways of counting the occurrences… So it is something that you need to think about… what is the definition that you need.

Best regards,

Thanks for the reply, Professor!

I do count the overlapping occurrences. Maybe I should try both. The reason is that I am using n-gram to do a classification task. Because the data is temporal, I am thinking to use sequential pattern mining and then compare with n-gram or add as some extra features and see if I get get a better result. What do you think?

Also, another question is that from my understanding, for sequential pattern mining, items within an element are unordered. but we get still get patterns from within an element.

For example

(a)(bc)(d)

(e)(bc)(f)

bc can be a freq seq.

if we have something like this,

(a)(bc)(d)

(e)(cb)(f)

bc is no longer like that. Does this mean when I construct the sequences, I have to keep their original order?

Hi again,

1) Yes, there is no order inside an itemset. So (bc) should have the same meaning as (cb). However, in the implementation of sequential pattern mining algorithms, we still put some order, but this is just for the purpose of optimizations. For example, in SPMF, I require that items whithin an itemset are ordered by some total order like the alphabetical order. This is just to be able to use some optimization in my code.

So in the input file, there need to be some order whithin an itemset that is ALWAYS the same like the alphabetical order. But this is just for the purpose of optimization. From the purpose of your application, (bc) is the same as (cb).

2) Yes, it will be interesting to compare n-grams with sequential patterns. In the context of text mining, sequential patterns are sometimes called “skip-grams”. I think it will be helpful for you to know this 😉

In some papers, I have actually used n-grams and skipgrams for the classification of texts for authorship attribution using SPMF:

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

LINK: http://www.philippe-fournier-viger.com/FLAIRS2016__AUTHORSHIP_ATTRIBUTION.pdf

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Using Frequent Fixed or Variable-Length POS Ngrams or Skip-grams for Blog Authorship Attribution . Proc. 12th Intern. Conf. on Artificial Intelligence Applications and Innovations (AIAI 2016), Springer LNAI, pp. 63-74

However, in these papers, I did not directly use words. I actually used skip-grams and ngrams of part-of-speech from the sentences. But it is similar. And actually, it is the project of a master degree student who did most of the work.

Best regards,

Philippe

Pingback: Datasets of 30 English novels for pattern mining and text mining - The Data Mining Blog