In this blog post, I will give an introduction to **sequential pattern mining, **an important data mining task with a wide range of applications from text analysis to market basket analysis. This blog post is aimed to be a short introductino. If you want to read a more detailed introduction to **sequential pattern mining**, you can read a survey paper that I recently wrote on this topic.

**What is sequential pattern mining?**

Data mining consists of extracting information from data stored in databases to understand the data and/or take decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and pattern mining. **Pattern mining** consists of discovering interesting, useful, and unexpected patterns in databases Various types of patterns can be discovered in databases such as frequent itemsets, associations, subgraphs, sequential rules, and periodic patterns.

The task of **sequential pattern mining** is a data mining task specialized for analyzing **sequential data, **to discover** sequential patterns**. More precisely, it consists of discovering interesting subsequences in **a set of sequences**, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit. Sequential pattern mining has numerous real-life applications due to the fact that data is naturally encoded as **sequences of symbols** in many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis.

I will now explain the task of** sequential pattern mining** with an example. Consider the following** sequence database**, representing the purchases made by customers in a retail store.

This database contains four sequences. Each **sequence** represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items *a* and *b* together, then purchased an item *c*, then purchased items *f* and *g *together, then purchased an item *g*, and then finally purchased an item *e. *

**Traditionally, sequential pattern mining **is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the **frequent** **sequential patterns**. For example, in the context of our example, sequential pattern mining can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

To do **sequential pattern mining**, a user must provide a sequence database and specify a parameter called the **minimum support threshold**. This parameter indicates a minimum number of sequences in which a pattern must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task of** sequential pattern mining** consists of finding all subsequences appearing in at least 2 sequences of the input database. In the example database, 29 subsequences met this requirement. These sequential patterns are shown in the table below, where the number of sequences containing each pattern (called the *support*) is indicated in the right column of the table.

For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively. The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3. These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example. Sequential pattern mining can actually be applied on database containing hundreds of thousands of sequences.

Another example of application of sequential pattern mining is text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal of sequential pattern mining is then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in this context. If you want to know more about this application, you can read this blog post, where sequential patterns are discovered in a Sherlock Holmes novel.

**Can sequential pattern mining be applied to time series?**

Besides sequences, **sequential pattern mining** can also be applied to** time series** (e.g. stock data), when discretization is performed as a pre-processing step. For example, the figure below shows a **time series** (an ordered list of numbers) on the left. On the right, a **sequence** (a sequence of symbols) is shown representing the same data, after applying a transformation. Various transformations can be done to transform a time series to a sequence such as the popular SAX transformation. After performing the transformation, any sequential pattern mining algorithm can be applied.

**Where can I get Sequential pattern mining implementations?**

To try sequential pattern mining with your datasets, you may try the open-source** SPMF data mining software**, which provides implementations of numerous** sequential pattern mining algorithms: **http://www.philippe-fournier-viger.com/spmf/

It provides implementations of several algorithms for sequential pattern mining, as well as several variations of the problem such as discovering **maximal sequential patterns**, **closed sequential patterns** and sequential rules. Sequential rules are especially useful for the purpose of performing predictions, as they also include the concept of confidence.

**What are the current best algorithms for sequential pattern mining?**

There exists several sequential pattern mining algorithms. Some of the classic algorithms for this problem are **PrefixSpan, Spade, SPAM,** and **GSP**. However, in the recent decade, several novel and more efficient algorithms have been proposed such as **CM-SPADE** and **CM-SPAM** (2014), **FCloSM **and** FGenSM** (2017), to name a few. Besides, numerous algorithms have been proposed for extensions of the problem of sequential pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential pattern mining).

**Conclusion**

In this blog post, I have given a brief overview of **sequential pattern mining**, a very useful set of techniques for analyzing sequential data. If you want to know more about this topic, you may read the following recent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequential pattern mining, extensions, research challenges and opportunities.

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). **A Survey of Sequential Pattern Mining**. Data Science and Pattern Recognition, vol. 1(1), pp. 54-77.

—

**Philippe Fournier-Viger** is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

What is the difference between sequential pattern and periodic pattern?

Hello, the basic difference is the following:

Periodic patterns are discovered in a single sequence. We want to find a pattern that periodically appears. For example, it must appear approximately every week.

Sequential patterns are discovered in a sequence database (multiple sequences). We want to find a pattern that is common to several sequences. If a pattern appears multiple times in the SAME sequence, it will still be counted as just once for sequential pattern mining algorithms.

I started reading your survey paper ( A Survey of Sequential Pattern Mining). I had a query. You stated that an item set X is a subset of items I.

Thus an item, say A, can appear at most ‘k’ number of times in an itemset if it appears ‘k’ number of items in set I.

However, while explaining BFS approach you took I = {a,b,c} and while enumerating 2-sequences you generated as well.

I guess there is an error in this enumeration.

Also while explaining definition of support of subsequence there is a mistake in its formulated definition.

Given is:

sup(sa) = | { s | s contained in sa ^ s belongs to SDB} |

I guess it should be:

sup(sa) = | { s | sa contained in s ^ s belongs to SDB} |

Please clarify this query.

Thanks.

Hello, Thanks for your feedback.

Yes, there was an error in the definition of sup(sa), as you noticed. I have fixed it and re-uploaded the paper to my website:

http://www.philippe-fournier-viger.com/dspr-paper5.pdf

Thanks for reporting it.

For the other question, I think that something is missing in your question. Do you want to ask why the same item can appear multiple times in the same pattern? If this is your question, then yes, an item is allowed to appear multiple in a same sequence. Let me explain this more clearly. In sequential pattern mining, an item can only appear at most one time in each itemset. But a sequence can contain multiple itemsets. Thus, it is possible that the same item appear multiple times in the same sequence by appearing in several itemsets (because a sequence is a list of itemsets). For example, the sequence (a)(a) is correct because it contains two times “a” but not in the same itemset. But the sequence (a a) would be incorrect according to the definition because it contains many times the same item in the same itemset. So, to explain again, an itemset X in a sequence must be a subset of I. Thus, an item cannot appear more than once in an itemset. However, a sequence contains many itemsets. Thus, the same item is allowed to appear multiple time in the same sequence, but it must be in different itemsets.

Hope this helps! And thanks for reporting the error.

Best regards,

Yes. That was my doubt. X should be a subset of I.

If I = {a,a,b,c} then itemset {a,a} is valid.

However if I = {a,b,c} then {a,a} is invalid.

I got confused because while explaining BFS based algorithms on Page 58 there is an example that takes I = {a,b,c}. And while enumerating 2-sequences contains itemsets like {a,a}, {b,b}, {c,c}.

I wanted to point out the above fact. As I = {a,b,c}. The sequence is valid. However the sequence is not.

EDIT in the above reply:

I wanted to point out the above fact. As I = {a,b,c}. The sequence ({a}, {a}) is valid. However the sequence ({a,a}) is not.

Hello,

I understand now. Yes, you are right, there is some error in that example. I have fixed and re-uploaded the paper to my website. Thanks!

Philippe

Hello,

I have a question about the example in this article. Is correct that the pattern has a support of 2? Because for what I understood this pattern appears in sequences 1, 3 and 4. And the same for the pattern alone. Are there some errors in the example or I am wrong in the comprehension of Sequential Pattern?

Thanks.

Hello, I would like to answer your question but I am not sure about which patterns you are talking? I think that when you submitted your question, WordPress might have removed the patterns from your comment because of the > symbols.

If you can tell me which pattern, I will answer your question.

Yes, it probably thinks that they are a tags. My bad. The patters are ({b},{f,g}) and also ({f,g}) alone.

Thanks.

Hello,

I’m curios about the difference between ‘high utility sequential pattern’ and ‘high utility itemset’. Because I’m studying on this research area in order to adopt this concept to find sequential high yield pattern from manufacturing processes. Thus, I’m trying to search several articles including ones written by yours.

While searching, I’m confused about the difference of the above two terms.

In my opinion, the main difference is ‘sequential’. Since the result of some algorithms related to ‘high utility itemset’ do not consider the order of transactions, right?

If I were right, I will able to narrow down my searching scope not ‘high utility itemset’ but ‘high utility sequential pattern’.

Thank you, Sir.

Hi,

Yes, that is correct. In

high utility itemset mining, the time or sequential order is not considered.In high utility itemset mining, the input is a transaction database ( a set of transactions performed by many customers).

The goal is then to find the group of items that yield a high profit when purchased together. But there is no order.

For example, you could find that {apple,bread} is a high utility itemset. It means that when bought together, apple and bread makes a lot of money. But it does not tell anything about the order.

In high utility

sequential pattern mining, the order is considered.Here the input is a sequence database. It means a set of sequences of transactions.

For example, you may have a database for 100 customers, and for each customer, you have a sequence of transactions that is sequentiall ordered.

Then you can find some patterns such as < (apple),(bread)> which means that buying apple followed by buying bread yield a high profit.

But a problem of high utility sequential pattern mining is that there is no concept of confidence.

If you have the pattern < (apple),(bread)> , you still don’t know likely it is that if someone buy Apple, he will then buy bread.

To solve this problem,

high utility sequential rule miningwas proposed.It is also applied on a sequence database, and will find rules such as:

Apple –> Bread confidence : 70 % which means that this rules yield a lot of money and when someone buy Apple, 70 % of the time he will buy bread after.

Besides that, if you have a single sequence instead of many sequences, you can look at episode mining. Episode mining is similar to sequential pattern mining except that patterns are mine from a single sequence instead of many sequences.

Thanks for your reply.

This comment is quite helpful to understand those concepts.

I’ve already read an article considering “USpan Algorithm”.

I think high utility sequential pattern mining is appropriate to my research problem.

Can you recommend some article about high utility sequential pattern mining except for “USpan Algorithm”?

I’m trying to find a kind of ‘Top K’ mining algorithm for high utility sequential pattern mining, but that’s not easy.

Btw, Thank you again for your kind reply.

Hello, You are welcome.

Yes, USpan is a good algorithm. The authors of that algorithm wrote a paper about a top-k algorithm for high utility sequential pattern mining, published at ICDM 2013:

https://www-staff.it.uts.edu.au/~lbcao/publication/icdm13-yin.pdf

The code of USpan is available in the SPMF data mining library as open-sourcehttp://www.philippe-fournier-viger.com/spmf/.

If you want some faster high utility sequentila pattern mining algorithms (faster than USpan), you can check CRoM and HuspExt.

Best,

Philippe

Hi,

I was wondering if you could explain briefly when a sequential pattern mining method (algorithm) is robust. How to validate the robustness of an algorithm?

Best regards,

Cristina

Hello,

In general, in computer science, an algorithm is said to be robust if it can tolerate some invalid or erroneous input. How would it apply to sequential pattern mining? Just like any computer program, you want a sequential pattern mining algorithm to work well with different kind of data, and to show some errors if the input data or parameter value(s) are not what the algorithm should expect. So this is software engineering 101. When you design a program, you should test it with different kind of data and parameters to make sure that it works in all situations, and make sure that it shows error message to the user when an error message should be shown.

Regards,

Thank you