In this blog post, I will give an introduction to **sequential pattern mining, **an important data mining task with a wide range of applications from text analysis to market basket analysis. This blog post is aimed to be a short introductino. If you want to read a more detailed introduction to **sequential pattern mining**, you can read a survey paper that I recently wrote on this topic.

**What is sequential pattern mining?**

Data mining consists of extracting information from data stored in databases to understand the data and/or take decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and pattern mining. **Pattern mining** consists of discovering interesting, useful, and unexpected patterns in databases Various types of patterns can be discovered in databases such as frequent itemsets, associations, subgraphs, sequential rules, and periodic patterns.

The task of **sequential pattern mining** is a data mining task specialized for analyzing **sequential data, **to discover** sequential patterns**. More precisely, it consists of discovering interesting subsequences in **a set of sequences**, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit. Sequential pattern mining has numerous real-life applications due to the fact that data is naturally encoded as **sequences of symbols** in many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis.

I will now explain the task of** sequential pattern mining** with an example. Consider the following** sequence database**, representing the purchases made by customers in a retail store.

This database contains four sequences. Each **sequence** represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items *a* and *b* together, then purchased an item *c*, then purchased items *f* and *g *together, then purchased an item *g*, and then finally purchased an item *e. *

**Traditionally, sequential pattern mining **is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the **frequent** **sequential patterns**. For example, in the context of our example, sequential pattern mining can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

To do **sequential pattern mining**, a user must provide a sequence database and specify a parameter called the **minimum support threshold**. This parameter indicates a minimum number of sequences in which a pattern must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task of** sequential pattern mining** consists of finding all subsequences appearing in at least 2 sequences of the input database. In the example database, 29 subsequences met this requirement. These sequential patterns are shown in the table below, where the number of sequences containing each pattern (called the *support*) is indicated in the right column of the table.

For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively. The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3. These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example. Sequential pattern mining can actually be applied on database containing hundreds of thousands of sequences.

Another example of application of sequential pattern mining is text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal of sequential pattern mining is then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in this context. If you want to know more about this application, you can read this blog post, where sequential patterns are discovered in a Sherlock Holmes novel.

**Can sequential pattern mining be applied to time series?**

Besides sequences, **sequential pattern mining** can also be applied to** time series** (e.g. stock data), when discretization is performed as a pre-processing step. For example, the figure below shows a **time series** (an ordered list of numbers) on the left. On the right, a **sequence** (a sequence of symbols) is shown representing the same data, after applying a transformation. Various transformations can be done to transform a time series to a sequence such as the popular SAX transformation. After performing the transformation, any sequential pattern mining algorithm can be applied.

**Where can I get Sequential pattern mining implementations?**

To try sequential pattern mining with your datasets, you may try the open-source** SPMF data mining software**, which provides implementations of numerous** sequential pattern mining algorithms: **http://www.philippe-fournier-viger.com/spmf/

It provides implementations of several algorithms for sequential pattern mining, as well as several variations of the problem such as discovering **maximal sequential patterns**, **closed sequential patterns** and sequential rules. Sequential rules are especially useful for the purpose of performing predictions, as they also include the concept of confidence.

**What are the current best algorithms for sequential pattern mining?**

There exists several sequential pattern mining algorithms. Some of the classic algorithms for this problem are **PrefixSpan, Spade, SPAM,** and **GSP**. However, in the recent decade, several novel and more efficient algorithms have been proposed such as **CM-SPADE** and **CM-SPAM** (2014), **FCloSM **and** FGenSM** (2017), to name a few. Besides, numerous algorithms have been proposed for extensions of the problem of sequential pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential pattern mining).

**Conclusion**

In this blog post, I have given a brief overview of **sequential pattern mining**, a very useful set of techniques for analyzing sequential data. If you want to know more about this topic, you may read the following recent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequential pattern mining, extensions, research challenges and opportunities.

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). **A Survey of Sequential Pattern Mining**. Data Science and Pattern Recognition, vol. 1(1), pp. 54-77.

—

**Philippe Fournier-Viger** is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

What is the difference between sequential pattern and periodic pattern?

Hello, the basic difference is the following:

Periodic patterns are discovered in a single sequence. We want to find a pattern that periodically appears. For example, it must appear approximately every week.

Sequential patterns are discovered in a sequence database (multiple sequences). We want to find a pattern that is common to several sequences. If a pattern appears multiple times in the SAME sequence, it will still be counted as just once for sequential pattern mining algorithms.

I am working on an interesting problem now. I’d like to pattern mine stack traces like the following and teach a parser how to process them.

“Error: Something unexpected has occurred.

at main (c:\Users\Me\Documents\MyApp\app.js:9:15)

at Object. (c:\Users\Me\Documents\MyApp\app.js:17:1)

at Module._compile (module.js:460:26)”

The interesting part, that according to your definition the “\nat * (*:*:*)” is both a sequential and a periodic pattern, because it is present in many different sequences and in a single sequence multiple times. Is there a way to distinguish it from the “^*Error: *\n**$” pattern, which is present only once per sequence beside that it is more frequent? Are these periodic pattern somehow special? Any thoughts, suggestions, recommended algorithms?

Hello,

Yes, you can see a stack trace as a sequence perhaps. Then the type of error could be seen as a “class” that annotates the sequence. For example, if you use my SPMF software, you can discover some

class sequential rulesusing the TopSeqClassRules. That means a type of sequential pattern that is specific to some class such as Error:Somethime unexpected happened” and that has the form of a rule such as X –> Y. I think you could check various algorithms and see what makes the most sense for your problems. Best,Okay, thanks!

Got invalid argument exception by your app. Most of the algorithms appear to require numbers. Does any of them work on strings? Or should I just modify the code?

Internally, all the algorithms use a number based representations of items/events because it is much more efficient when searching for patterns than using strings, in terms of space and memory. For example, if you have two strings “abcdefghij” and “abcdefghik”, then you can represent them as strings or as integers. If internally, we represent as strings, then we need memory to store about 10 characters, while an integer is only 32 bits or 64 bits. Besides, comparing two integers is much faster than comparing two strings. If you compare two integers, you just use maybe one instruction of the CPU (>= <>) while comparing two strings, requires to compare each character one by one. This is the reason why these algorithms uses numbers as internal representation.

Having said that, this is also the reason why the default input format for most algorithms in SPMF is based on integers. But you can also use String. A simple way is to convert your input file from strings to integers, apply the algorithm, and then convert the integers to strings in the result. I have actually, implemented something like that in SPMF. If you are using the user interface of SPMF, you can use string as input for some algorithms like CM-SPAM. Here is a description of how to do that from the documentation of SPMF for that algorithm:

If you are not using the graphical interface of SPMF, but want to use strings, you can check example, this example of the documentation:

http://www.philippe-fournier-viger.com/spmf/Using_a_TEXT_file_as_input.php

In my case it does not help much, but I could use ASCII codes to convert a trace to a seq of numbers. Most of the traces contain ASCII characters only I think, so for testing purposes this will be just fine. As far as I can tell the rules of your input format is the following: 1 sequence in 1 line and the items are separated by space. I guess there is no way to use itemsets instead of items, but in my case itemsets are not needed.

Yes, the format of the input file for the algorithms is explained in the documentation on the webpage for each algorithm. For example, to mine sequential patterns using a normal algorithms like CM-SPAM or sequential rules with RuleGrowth, the format is:

There is an online ascii converter, so there is no need to write a single line of code:

https://www.browserling.com/tools/text-to-ascii

As far as I understand these pattern mining algorithms find subsequences or substrings. I’d like to convert my patterns to regex patterns, so parsing would be a lot easier. For example if I have the following sequences: “abbc”, “abxc”,”aybc”,”abzc”, then the “abc” will be the common subsequence with 4/4 support. But if I want to parse the sequences with “abc”, then I will get the following results: “ab(b)c”,”a(b)bc”,”ab(x)c”,”a(y)bc”,”ab(z)c”. As you can see the content of the wildcard can be either before or after the “b” item. So if I want to translate this to regex I would have 2 patterns: /ab(.*)c/”, “a(.*)bc” or we can merge them into one: /a(.*)b(.*)c/. If I expect a single capturing group always on the same place, then I have to choose between the first two regex patterns. Is there any algorithm which interprets patterns like this (so checks where the gap position is by matching), or do I have to invent some kind of converter between the subsequence patterns and the regex patterns? I could write a parser based on subsequences too, but I assume the C based regex of the actual js engine is a lot faster than writing something in pure javascript.

I started reading your survey paper ( A Survey of Sequential Pattern Mining). I had a query. You stated that an item set X is a subset of items I.

Thus an item, say A, can appear at most ‘k’ number of times in an itemset if it appears ‘k’ number of items in set I.

However, while explaining BFS approach you took I = {a,b,c} and while enumerating 2-sequences you generated as well.

I guess there is an error in this enumeration.

Also while explaining definition of support of subsequence there is a mistake in its formulated definition.

Given is:

sup(sa) = | { s | s contained in sa ^ s belongs to SDB} |

I guess it should be:

sup(sa) = | { s | sa contained in s ^ s belongs to SDB} |

Please clarify this query.

Thanks.

Hello, Thanks for your feedback.

Yes, there was an error in the definition of sup(sa), as you noticed. I have fixed it and re-uploaded the paper to my website:

http://www.philippe-fournier-viger.com/dspr-paper5.pdf

Thanks for reporting it.

For the other question, I think that something is missing in your question. Do you want to ask why the same item can appear multiple times in the same pattern? If this is your question, then yes, an item is allowed to appear multiple in a same sequence. Let me explain this more clearly. In sequential pattern mining, an item can only appear at most one time in each itemset. But a sequence can contain multiple itemsets. Thus, it is possible that the same item appear multiple times in the same sequence by appearing in several itemsets (because a sequence is a list of itemsets). For example, the sequence (a)(a) is correct because it contains two times “a” but not in the same itemset. But the sequence (a a) would be incorrect according to the definition because it contains many times the same item in the same itemset. So, to explain again, an itemset X in a sequence must be a subset of I. Thus, an item cannot appear more than once in an itemset. However, a sequence contains many itemsets. Thus, the same item is allowed to appear multiple time in the same sequence, but it must be in different itemsets.

Hope this helps! And thanks for reporting the error.

Best regards,

Yes. That was my doubt. X should be a subset of I.

If I = {a,a,b,c} then itemset {a,a} is valid.

However if I = {a,b,c} then {a,a} is invalid.

I got confused because while explaining BFS based algorithms on Page 58 there is an example that takes I = {a,b,c}. And while enumerating 2-sequences contains itemsets like {a,a}, {b,b}, {c,c}.

I wanted to point out the above fact. As I = {a,b,c}. The sequence is valid. However the sequence is not.

EDIT in the above reply:

I wanted to point out the above fact. As I = {a,b,c}. The sequence ({a}, {a}) is valid. However the sequence ({a,a}) is not.

Hello,

I understand now. Yes, you are right, there is some error in that example. I have fixed and re-uploaded the paper to my website. Thanks!

Philippe

Hello,

I have a question about the example in this article. Is correct that the pattern has a support of 2? Because for what I understood this pattern appears in sequences 1, 3 and 4. And the same for the pattern alone. Are there some errors in the example or I am wrong in the comprehension of Sequential Pattern?

Thanks.

Hello, I would like to answer your question but I am not sure about which patterns you are talking? I think that when you submitted your question, WordPress might have removed the patterns from your comment because of the > symbols.

If you can tell me which pattern, I will answer your question.

Yes, it probably thinks that they are a tags. My bad. The patters are ({b},{f,g}) and also ({f,g}) alone.

Thanks.

Hi there is nothing wrong with that. Here is the point, you should look for if the dataset (the first sequence) involves the subsequence respectively. To be more clear, b is a subsequence of (a,b) in the sequence so it satisfies the rule then look for the (f,g) I believe your trouble is with b so I will not talk about the other items.

Yes, now it is correct. At the time there was an error. The two sequential patterns that I mentioned had a support of 2 in the old image, as reported in a comment below.

Hello,

I’m curios about the difference between ‘high utility sequential pattern’ and ‘high utility itemset’. Because I’m studying on this research area in order to adopt this concept to find sequential high yield pattern from manufacturing processes. Thus, I’m trying to search several articles including ones written by yours.

While searching, I’m confused about the difference of the above two terms.

In my opinion, the main difference is ‘sequential’. Since the result of some algorithms related to ‘high utility itemset’ do not consider the order of transactions, right?

If I were right, I will able to narrow down my searching scope not ‘high utility itemset’ but ‘high utility sequential pattern’.

Thank you, Sir.

Hi,

Yes, that is correct. In

high utility itemset mining, the time or sequential order is not considered.In high utility itemset mining, the input is a transaction database ( a set of transactions performed by many customers).

The goal is then to find the group of items that yield a high profit when purchased together. But there is no order.

For example, you could find that {apple,bread} is a high utility itemset. It means that when bought together, apple and bread makes a lot of money. But it does not tell anything about the order.

In high utility

sequential pattern mining, the order is considered.Here the input is a sequence database. It means a set of sequences of transactions.

For example, you may have a database for 100 customers, and for each customer, you have a sequence of transactions that is sequentiall ordered.

Then you can find some patterns such as < (apple),(bread)> which means that buying apple followed by buying bread yield a high profit.

But a problem of high utility sequential pattern mining is that there is no concept of confidence.

If you have the pattern < (apple),(bread)> , you still don’t know likely it is that if someone buy Apple, he will then buy bread.

To solve this problem,

high utility sequential rule miningwas proposed.It is also applied on a sequence database, and will find rules such as:

Apple –> Bread confidence : 70 % which means that this rules yield a lot of money and when someone buy Apple, 70 % of the time he will buy bread after.

Besides that, if you have a single sequence instead of many sequences, you can look at episode mining. Episode mining is similar to sequential pattern mining except that patterns are mine from a single sequence instead of many sequences.

Thanks for your reply.

This comment is quite helpful to understand those concepts.

I’ve already read an article considering “USpan Algorithm”.

I think high utility sequential pattern mining is appropriate to my research problem.

Can you recommend some article about high utility sequential pattern mining except for “USpan Algorithm”?

I’m trying to find a kind of ‘Top K’ mining algorithm for high utility sequential pattern mining, but that’s not easy.

Btw, Thank you again for your kind reply.

Hello, You are welcome.

Yes, USpan is a good algorithm. The authors of that algorithm wrote a paper about a top-k algorithm for high utility sequential pattern mining, published at ICDM 2013:

https://www-staff.it.uts.edu.au/~lbcao/publication/icdm13-yin.pdf

The code of USpan is available in the SPMF data mining library as open-sourcehttp://www.philippe-fournier-viger.com/spmf/.

If you want some faster high utility sequentila pattern mining algorithms (faster than USpan), you can check CRoM and HuspExt.

Best,

Philippe

Hi,

I was wondering if you could explain briefly when a sequential pattern mining method (algorithm) is robust. How to validate the robustness of an algorithm?

Best regards,

Cristina

Hello,

In general, in computer science, an algorithm is said to be robust if it can tolerate some invalid or erroneous input. How would it apply to sequential pattern mining? Just like any computer program, you want a sequential pattern mining algorithm to work well with different kind of data, and to show some errors if the input data or parameter value(s) are not what the algorithm should expect. So this is software engineering 101. When you design a program, you should test it with different kind of data and parameters to make sure that it works in all situations, and make sure that it shows error message to the user when an error message should be shown.

Regards,

Thank you

what is the difference between frequent pattern mining and sequential pattern mining?

In simple words, in frequent itemset mining, we want to find some patterns (sets of items) that appear in many transactions. But we don’t care about the time or order between these items. For example, we could find that many customers buy bread with apple. But there is no order between bread and apple. We just know that many people buy it together

In sequential pattern mining, we also consider the order or time. Instead of finding patterns in transactions, we find patterns in some sequences. Each sequence is an ordered list of transactions. For example, we can have some sequence such that today i buy bread, tomorrow i buy milk, and after tomorrow i buy bread and milk. Thus, we have many sequences like that and we want to find the subsequences that appear in many sequences. For example, we could find the sequential pattern that many people buy milk and then after that buy bread, and then later buy coffee. Here there is some order.

In the case where all sequences only contain one transaction, then sequential pattern mining is the same as frequent itemset mining. Thus sequential pattern mining is more general.

I tried to give you the main idea in a simple way. If you want some more formal definition you could look at my survey papers on these topics which are easy to understand, and read the first few pages:

Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, e1207 doi: 10.1002/widm.1207, 18 pages. ISI, EI, SCI, Q1

ISSN: 1942-4787

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition (DSPR), vol. 1(1), pp. 54-77.

Hi Philippe,

Thank you very much for this useful introduction. I have the following questions:

1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset should be a sequential pattern since it appears in SID1 and SID2.

2. I think the support of should be 3 since it appears in SID1, SID3, and SID4 and the support of should be 3 since it appears in SID1, SID3, and SID4.

I am now reading your survey paper on frequent sequential pattern mining to gain more knowledge on this field.

Happy New Year.

Dang Nguyen

Since some errors happened, I re-write my questions:

1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset ({b, a}) should be a sequential pattern since it appears in SID1 and SID2.

2. I think the support of ({b}, {f, g}) should be 3 since it appears in SID1, SID3, and SID4 and the support of ({f, g}) should be 3 since it appears in SID1, SID3, and SID4.

Hi Dang,

Thanks for your comments and happy new year.

1. Yes, ({a,b}) has a support of 2 in the above table. As you said, there is no order in the itemsets. Thus ({a,b}) is the same as ({b,a)}.

2. Yes, that is an error in the example! I have updated the example so that ({b}, {f, g}) and ( {f, g}) have a support of 3. I will update the survey paper, which contains the same error probably tonight. Thanks for reporting this issue.

Best,

Philippe

Is it possible to use contiguous pattern mining for mining breast cancer data. I’m trying to mine the particular gene BRCA1 or BRCA2 to find the interesting patterns in it. Research direction is correct or not? Kindly give your valuable suggestios

Hi,

Thanks for reading. You could try to use that. But is it a good idea? Since I know nothing about cancer data and genes, it cannot give advice about that.

Best,

Thank u for your reply.

Mining normal and cancer data using various parameters and comparing them. This approach works?

It depends on what is your goal and what is the result. You can try it. If you can find some interesting patterns that are useful to understand cancer, then it will be interesting. Ideally you could find some patterns that you would validate with a domain expert (e.g. doctor) to show that these patterns are interesting.

Otherwise, you can work on a more technical level and work on improving the algorithms to add more features that are needed for this type of data.

Thanks a lot

First of all thanks for this amazing article.

I recently learned about SPMF and wanted to use it on my data, unfortunately I didn’t succeed to apply it properly.

My data is a time series and look like this (DATE TIME , Name):

2009-03-12 14:15:17 , James

2009-03-12 14:15:19 , James

2009-03-12 14:15:07 , James

2009-03-12 14:17:11 , Mike

2009-03-12 14:25:15 , Julie

2009-03-13 14:32:39 , Thomas

2009-03-13 14:48:46 , Mike

2009-03-13 14:51:23 , Madona

2009-03-13 14:59:47 , Julie

And so on.

I am searching for any possible pattern in this data.

More precisely I want to know “When Julie sends a message?”.

And if there is any relation between Julie’s and others’ behavior.

There are also the assumptions below:

1. I do not know if there is any relation or not.

2. Always there might be new people joining the chat room.

3. There are several “usual guests” who are always taking part in conversations and are very talkative.

4. We do not know anything about these people (and their behavior). They are just a “name” for us.

Now my question is: Is there any possibility to use SPMF to extract patterns from this time series, which could potentially be used to predict Julie’s behavior according to the others’ behavior?

Thank you again for this wonderful blog.

Hi, Glad you like the blog post and the blog 😉

Yes, this is a time-series but if you ignore the timestamps, you can also see it as a sequence and apply sequential pattern mining algorithm. The key is that if you can convert your data to the same format as some existing problem, then you can use the existing algorithms.

In your case, you could view each day of the chat room as a sequence: a sequence for 2018-01-01, a sequence for 2018-01-02, a sequence for 2018-01-03. Then each sequence would be like (mike_join, mike_post_message, julie_join, julie_post, mike_quit, julie_quit…) Thus, you would have a sequence database with a sequence for each day, and then you could apply sequential pattern mining or sequential rule mining to find patterns common to severa days… Or maybe that a day is too long. Actually, if you use days, then the sequences will be very long and the algorithm may not work well, so perhaps that dividing by hours may work better.

But I am not sure that pattern mining is the best technique for this problem.

But if you convert your data to sequences, you could also try to the sequence prediction algorithm that can be trained with sequence to predice what will happen next for a new sequence.

This is just some idea. Actually, your problem may be a bit more complex, especially if you also want to consider timestamps!

Best,

Philippe

Thanks for the detailed reply.

I will give the “sequential pattern mining” a try and see if it provides me useful results. However, I still need to know learn what to do with the noises. For example when the internet connection of some users once in a while disconnects and users logout/login randomly.

For the learning purpose, I still would like to know if there is any possibility to detect temporal patterns in this type of time-series.

For example, can we somehow come up with the below (imaginary) model:

IF (James AND [+2 minutes] Mike) THEN ([+8 minutes] Julie).

IF (Thomas AND [+16 minutes] Mike AND [+3 minutes] Madona) THEN ([+8 minutes] Julie).

Is there any technique which could be useful in such scenarios? Also in presence of noises?

Hello,

Glad my reply is useful. In my opinion, if you consider the time, it becomes much more complicated because you then need to decide whether (James AND [+60 seconds] Mike) is the same as (James AND [+61 seconds] Mike) or (James AND [+65 seconds] Mike). If you remove the time, then they become the same pattern. But if you consider time… then either all of these patterns are considered as different (but it is probably not what you want…) or you need to find a way to define what is similar such as 60 with more or less 5 minutes is considered as similar… Most paper that I have read in sequential pattern mining cannot do that. But it does not mean that it cannot be done. You could always design a new algorithm and you add some news of handling the time with more flexibility and write a paper about it.

An example of paper that consider the time in sequential pattern mining is:

http://www.philippe-fournier-viger.com/spmf/hirateyamana.pdf

However, for that paper, A +10 seconds B will be seen as different from A +11 seconds B for example…

Best,

Thank you very very much for your answers.

Now I have a better picture in mind and will follow your suggestions.

Bests 🙂