An Introduction to Sequential Pattern Mining

In this blog post, I will give an introduction to sequential pattern mining, an important data mining task with a wide range of applications from text analysis to market basket analysis.  This blog post is aimed to be a short introductino. If you want to read a more detailed introduction to sequential pattern mining, you can read a survey paper  that I recently wrote on this topic.

What is sequential pattern mining?

Data mining consists of extracting information from data stored in databases to understand the data and/or take decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and pattern mining. Pattern mining consists of discovering interesting, useful, and unexpected patterns in databases  Various types of patterns can be discovered in databases such as frequent itemsets, associations, subgraphs, sequential rules, and periodic patterns.

The task of sequential pattern mining is a data mining task specialized for analyzing sequential data, to discover sequential patterns. More precisely, it consists of discovering interesting subsequences in a set of sequences, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit. Sequential pattern mining has numerous real-life applications due to the fact that data is naturally encoded as sequences of symbols in many fields such as bioinformatics, e-learning, market basket analysis, texts, and  webpage click-stream analysis.

I will now explain the task of sequential pattern mining with an example. Consider the following sequence database, representing the purchases made by customers in a retail store.

sequence database

This database contains four sequences.  Each sequence represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items a and b together, then purchased an item c, then purchased items f and g together, then purchased an item g, and then finally purchased an item e.  

Traditionally, sequential pattern mining is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the frequent sequential patterns. For example, in the context of our example, sequential pattern mining can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

To do sequential pattern mining, a user must provide a sequence database and specify a parameter called the minimum support threshold. This parameter indicates a minimum number of sequences in which a pattern must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task of sequential pattern mining consists of finding all subsequences appearing in at least 2 sequences of the input database.  In the example database, 29  subsequences met this requirement. These sequential patterns are shown in the table below, where the number of sequences containing each pattern (called the support) is indicated in the right column of the table.

sequential patterns example

For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively.  The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3.   These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example. Sequential pattern mining can actually be applied on database containing hundreds of thousands of sequences.

Another example of application of sequential pattern mining is text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal of sequential pattern mining is then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in this context. If you want to know more about this application, you can read this blog post, where sequential patterns are discovered in a Sherlock Holmes novel.

Can sequential pattern mining be applied to time series?

Besides sequences, sequential pattern mining can also be applied to time series (e.g. stock data), when discretization is performed as a pre-processing step.  For example, the figure below shows a time series  (an ordered list of numbers) on the left. On the right, a sequence (a sequence of symbols) is shown representing the same data, after applying a transformation.   Various transformations can be done to transform a time series to a sequence such as the popular SAX transformation. After performing the transformation, any sequential pattern mining algorithm can be applied.

sequences and time series

Where can I get Sequential pattern mining implementations?

To try sequential pattern mining with your datasets, you may try the open-source SPMF data mining software, which provides implementations of numerous sequential pattern mining algorithms: http://www.philippe-fournier-viger.com/spmf/

It provides implementations of several algorithms for sequential pattern mining, as well as several variations of the problem such as discovering maximal sequential patterns, closed sequential patterns and sequential rules. Sequential rules are especially useful for the purpose of performing predictions, as they also include the concept of confidence.

What are the current best algorithms for sequential pattern mining?

There exists several sequential pattern mining algorithms. Some of the classic algorithms for this problem are PrefixSpan, Spade, SPAM, and GSP. However, in the recent decade, several novel  and more efficient algorithms have been proposed such as CM-SPADE  and CM-SPAM (2014), FCloSM and FGenSM (2017), to name a few.  Besides, numerous algorithms have been proposed for extensions of the problem of sequential pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential pattern mining).

Conclusion

In this blog post, I have given a brief overview of sequential pattern mining, a very useful set of techniques for analyzing sequential data.  If you want to know more about this topic, you may read the following recent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequential pattern mining, extensions,  research challenges and opportunities.

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition, vol. 1(1), pp. 54-77.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

(Visited 4,854 times, 24 visits today)

Comments

An Introduction to Sequential Pattern Mining — 59 Comments

    • Hello, the basic difference is the following:

      Periodic patterns are discovered in a single sequence. We want to find a pattern that periodically appears. For example, it must appear approximately every week.

      Sequential patterns are discovered in a sequence database (multiple sequences). We want to find a pattern that is common to several sequences. If a pattern appears multiple times in the SAME sequence, it will still be counted as just once for sequential pattern mining algorithms.

      • I am working on an interesting problem now. I’d like to pattern mine stack traces like the following and teach a parser how to process them.
        “Error: Something unexpected has occurred.
        at main (c:\Users\Me\Documents\MyApp\app.js:9:15)
        at Object. (c:\Users\Me\Documents\MyApp\app.js:17:1)
        at Module._compile (module.js:460:26)”
        The interesting part, that according to your definition the “\nat * (*:*:*)” is both a sequential and a periodic pattern, because it is present in many different sequences and in a single sequence multiple times. Is there a way to distinguish it from the “^*Error: *\n**$” pattern, which is present only once per sequence beside that it is more frequent? Are these periodic pattern somehow special? Any thoughts, suggestions, recommended algorithms?

        • Hello,
          Yes, you can see a stack trace as a sequence perhaps. Then the type of error could be seen as a “class” that annotates the sequence. For example, if you use my SPMF software, you can discover some class sequential rules using the TopSeqClassRules. That means a type of sequential pattern that is specific to some class such as Error:Somethime unexpected happened” and that has the form of a rule such as X –> Y. I think you could check various algorithms and see what makes the most sense for your problems. Best,

          • Got invalid argument exception by your app. Most of the algorithms appear to require numbers. Does any of them work on strings? Or should I just modify the code?

          • Internally, all the algorithms use a number based representations of items/events because it is much more efficient when searching for patterns than using strings, in terms of space and memory. For example, if you have two strings “abcdefghij” and “abcdefghik”, then you can represent them as strings or as integers. If internally, we represent as strings, then we need memory to store about 10 characters, while an integer is only 32 bits or 64 bits. Besides, comparing two integers is much faster than comparing two strings. If you compare two integers, you just use maybe one instruction of the CPU (>= <>) while comparing two strings, requires to compare each character one by one. This is the reason why these algorithms uses numbers as internal representation.

            Having said that, this is also the reason why the default input format for most algorithms in SPMF is based on integers. But you can also use String. A simple way is to convert your input file from strings to integers, apply the algorithm, and then convert the integers to strings in the result. I have actually, implemented something like that in SPMF. If you are using the user interface of SPMF, you can use string as input for some algorithms like CM-SPAM. Here is a description of how to do that from the documentation of SPMF for that algorithm:

            Optional feature: giving names to items

            Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database “contextPrefixSpan.txt”. Here we have modified the file to give names to the items:

            @CONVERTED_FROM_TEXT
            @ITEM=1=apple
            @ITEM=2=orange
            @ITEM=3=tomato
            @ITEM=4=milk
            @ITEM=5=bread
            @ITEM=6=noodle
            @ITEM=7=rice
            @ITEM=-1=|
            1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
            1 4 -1 3 -1 2 3 -1 1 5 -1 -2
            5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
            5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

            In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called “apple”. The third line indicates that the item 2 is called “orange”. The 9th line indicates that the symbol “-1” must be replaced by “|”. Then the following lines define four sequences in the SPMF format.

            Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns having this format:

            apple | orange | #SUP: 4

            Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.

            If you are not using the graphical interface of SPMF, but want to use strings, you can check example, this example of the documentation:

            http://www.philippe-fournier-viger.com/spmf/Using_a_TEXT_file_as_input.php

          • In my case it does not help much, but I could use ASCII codes to convert a trace to a seq of numbers. Most of the traces contain ASCII characters only I think, so for testing purposes this will be just fine. As far as I can tell the rules of your input format is the following: 1 sequence in 1 line and the items are separated by space. I guess there is no way to use itemsets instead of items, but in my case itemsets are not needed.

          • Yes, the format of the input file for the algorithms is explained in the documentation on the webpage for each algorithm. For example, to mine sequential patterns using a normal algorithms like CM-SPAM or sequential rules with RuleGrowth, the format is:

            Input file format

            The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value “-1” indicates the end of an itemset. The value “-2” indicates the end of a sequence (it appears at the end of each line). For example, the input file “contextPrefixSpan.txt” contains the following four lines (four sequences).

            1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
            1 4 -1 3 -1 2 3 -1 1 5 -1 -2
            5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
            5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

            The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

            Note that it is also possible to use a text file containing a text (several sentences) if the text file has the “.text” extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by “.”, “?” and “!”, where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

          • As far as I understand these pattern mining algorithms find subsequences or substrings. I’d like to convert my patterns to regex patterns, so parsing would be a lot easier. For example if I have the following sequences: “abbc”, “abxc”,”aybc”,”abzc”, then the “abc” will be the common subsequence with 4/4 support. But if I want to parse the sequences with “abc”, then I will get the following results: “ab(b)c”,”a(b)bc”,”ab(x)c”,”a(y)bc”,”ab(z)c”. As you can see the content of the wildcard can be either before or after the “b” item. So if I want to translate this to regex I would have 2 patterns: /ab(.*)c/”, “a(.*)bc” or we can merge them into one: /a(.*)b(.*)c/. If I expect a single capturing group always on the same place, then I have to choose between the first two regex patterns. Is there any algorithm which interprets patterns like this (so checks where the gap position is by matching), or do I have to invent some kind of converter between the subsequence patterns and the regex patterns? I could write a parser based on subsequences too, but I assume the C based regex of the actual js engine is a lot faster than writing something in pure javascript.

  1. I started reading your survey paper ( A Survey of Sequential Pattern Mining). I had a query. You stated that an item set X is a subset of items I.
    Thus an item, say A, can appear at most ‘k’ number of times in an itemset if it appears ‘k’ number of items in set I.
    However, while explaining BFS approach you took I = {a,b,c} and while enumerating 2-sequences you generated as well.
    I guess there is an error in this enumeration.

    Also while explaining definition of support of subsequence there is a mistake in its formulated definition.
    Given is:
    sup(sa) = | { s | s contained in sa ^ s belongs to SDB} |
    I guess it should be:
    sup(sa) = | { s | sa contained in s ^ s belongs to SDB} |

    Please clarify this query.
    Thanks.

    • Hello, Thanks for your feedback.

      Yes, there was an error in the definition of sup(sa), as you noticed. I have fixed it and re-uploaded the paper to my website:

      http://www.philippe-fournier-viger.com/dspr-paper5.pdf

      Thanks for reporting it.

      For the other question, I think that something is missing in your question. Do you want to ask why the same item can appear multiple times in the same pattern? If this is your question, then yes, an item is allowed to appear multiple in a same sequence. Let me explain this more clearly. In sequential pattern mining, an item can only appear at most one time in each itemset. But a sequence can contain multiple itemsets. Thus, it is possible that the same item appear multiple times in the same sequence by appearing in several itemsets (because a sequence is a list of itemsets). For example, the sequence (a)(a) is correct because it contains two times “a” but not in the same itemset. But the sequence (a a) would be incorrect according to the definition because it contains many times the same item in the same itemset. So, to explain again, an itemset X in a sequence must be a subset of I. Thus, an item cannot appear more than once in an itemset. However, a sequence contains many itemsets. Thus, the same item is allowed to appear multiple time in the same sequence, but it must be in different itemsets.

      Hope this helps! And thanks for reporting the error.

      Best regards,

      • Yes. That was my doubt. X should be a subset of I.
        If I = {a,a,b,c} then itemset {a,a} is valid.
        However if I = {a,b,c} then {a,a} is invalid.

        I got confused because while explaining BFS based algorithms on Page 58 there is an example that takes I = {a,b,c}. And while enumerating 2-sequences contains itemsets like {a,a}, {b,b}, {c,c}.

        I wanted to point out the above fact. As I = {a,b,c}. The sequence is valid. However the sequence is not.

  2. Hello,

    I have a question about the example in this article. Is correct that the pattern has a support of 2? Because for what I understood this pattern appears in sequences 1, 3 and 4. And the same for the pattern alone. Are there some errors in the example or I am wrong in the comprehension of Sequential Pattern?

    Thanks.

    • Hello, I would like to answer your question but I am not sure about which patterns you are talking? I think that when you submitted your question, WordPress might have removed the patterns from your comment because of the > symbols.
      If you can tell me which pattern, I will answer your question.

      • Yes, it probably thinks that they are a tags. My bad. The patters are ({b},{f,g}) and also ({f,g}) alone.

        Thanks.

        • Hi there is nothing wrong with that. Here is the point, you should look for if the dataset (the first sequence) involves the subsequence respectively. To be more clear, b is a subsequence of (a,b) in the sequence so it satisfies the rule then look for the (f,g) I believe your trouble is with b so I will not talk about the other items.

          • Yes, now it is correct. At the time there was an error. The two sequential patterns that I mentioned had a support of 2 in the old image, as reported in a comment below.

  3. Hello,

    I’m curios about the difference between ‘high utility sequential pattern’ and ‘high utility itemset’. Because I’m studying on this research area in order to adopt this concept to find sequential high yield pattern from manufacturing processes. Thus, I’m trying to search several articles including ones written by yours.
    While searching, I’m confused about the difference of the above two terms.
    In my opinion, the main difference is ‘sequential’. Since the result of some algorithms related to ‘high utility itemset’ do not consider the order of transactions, right?
    If I were right, I will able to narrow down my searching scope not ‘high utility itemset’ but ‘high utility sequential pattern’.

    Thank you, Sir.

    • Hi,
      Yes, that is correct. In high utility itemset mining, the time or sequential order is not considered.
      In high utility itemset mining, the input is a transaction database ( a set of transactions performed by many customers).
      The goal is then to find the group of items that yield a high profit when purchased together. But there is no order.
      For example, you could find that {apple,bread} is a high utility itemset. It means that when bought together, apple and bread makes a lot of money. But it does not tell anything about the order.
      In high utility sequential pattern mining, the order is considered.
      Here the input is a sequence database. It means a set of sequences of transactions.
      For example, you may have a database for 100 customers, and for each customer, you have a sequence of transactions that is sequentiall ordered.
      Then you can find some patterns such as < (apple),(bread)> which means that buying apple followed by buying bread yield a high profit.

      But a problem of high utility sequential pattern mining is that there is no concept of confidence.
      If you have the pattern < (apple),(bread)> , you still don’t know likely it is that if someone buy Apple, he will then buy bread.
      To solve this problem, high utility sequential rule mining was proposed.
      It is also applied on a sequence database, and will find rules such as:
      Apple –> Bread confidence : 70 % which means that this rules yield a lot of money and when someone buy Apple, 70 % of the time he will buy bread after.

      Besides that, if you have a single sequence instead of many sequences, you can look at episode mining. Episode mining is similar to sequential pattern mining except that patterns are mine from a single sequence instead of many sequences.

      • Thanks for your reply.
        This comment is quite helpful to understand those concepts.
        I’ve already read an article considering “USpan Algorithm”.
        I think high utility sequential pattern mining is appropriate to my research problem.
        Can you recommend some article about high utility sequential pattern mining except for “USpan Algorithm”?
        I’m trying to find a kind of ‘Top K’ mining algorithm for high utility sequential pattern mining, but that’s not easy.

        Btw, Thank you again for your kind reply.

  4. I read your paper A Survey of Sequential Pattern Mining, i dont understand the difference between closed,maximal, and generator. Could you please give me a simple review and when to used that.
    Another question, i have read paper said that contiguous sequential pattern mining is better than sequential pattern mining. Could you please give me a hint, why?
    I have search many papers but cant find the answers.
    Thank you.

  5. Hi,

    I was wondering if you could explain briefly when a sequential pattern mining method (algorithm) is robust. How to validate the robustness of an algorithm?

    Best regards,
    Cristina

    • Hello,
      In general, in computer science, an algorithm is said to be robust if it can tolerate some invalid or erroneous input. How would it apply to sequential pattern mining? Just like any computer program, you want a sequential pattern mining algorithm to work well with different kind of data, and to show some errors if the input data or parameter value(s) are not what the algorithm should expect. So this is software engineering 101. When you design a program, you should test it with different kind of data and parameters to make sure that it works in all situations, and make sure that it shows error message to the user when an error message should be shown.

      Regards,

    • In simple words, in frequent itemset mining, we want to find some patterns (sets of items) that appear in many transactions. But we don’t care about the time or order between these items. For example, we could find that many customers buy bread with apple. But there is no order between bread and apple. We just know that many people buy it together

      In sequential pattern mining, we also consider the order or time. Instead of finding patterns in transactions, we find patterns in some sequences. Each sequence is an ordered list of transactions. For example, we can have some sequence such that today i buy bread, tomorrow i buy milk, and after tomorrow i buy bread and milk. Thus, we have many sequences like that and we want to find the subsequences that appear in many sequences. For example, we could find the sequential pattern that many people buy milk and then after that buy bread, and then later buy coffee. Here there is some order.

      In the case where all sequences only contain one transaction, then sequential pattern mining is the same as frequent itemset mining. Thus sequential pattern mining is more general.

      I tried to give you the main idea in a simple way. If you want some more formal definition you could look at my survey papers on these topics which are easy to understand, and read the first few pages:

      Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, e1207 doi: 10.1002/widm.1207, 18 pages. ISI, EI, SCI, Q1
      ISSN: 1942-4787

      Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition (DSPR), vol. 1(1), pp. 54-77.

  6. Hi Philippe,
    Thank you very much for this useful introduction. I have the following questions:
    1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset should be a sequential pattern since it appears in SID1 and SID2.
    2. I think the support of should be 3 since it appears in SID1, SID3, and SID4 and the support of should be 3 since it appears in SID1, SID3, and SID4.

    I am now reading your survey paper on frequent sequential pattern mining to gain more knowledge on this field.
    Happy New Year.
    Dang Nguyen

    • Since some errors happened, I re-write my questions:
      1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset ({b, a}) should be a sequential pattern since it appears in SID1 and SID2.
      2. I think the support of ({b}, {f, g}) should be 3 since it appears in SID1, SID3, and SID4 and the support of ({f, g}) should be 3 since it appears in SID1, SID3, and SID4.

      • Hi Dang,

        Thanks for your comments and happy new year.

        1. Yes, ({a,b}) has a support of 2 in the above table. As you said, there is no order in the itemsets. Thus ({a,b}) is the same as ({b,a)}.
        2. Yes, that is an error in the example! I have updated the example so that ({b}, {f, g}) and ( {f, g}) have a support of 3. I will update the survey paper, which contains the same error probably tonight. Thanks for reporting this issue.

        Best,

        Philippe

  7. Is it possible to use contiguous pattern mining for mining breast cancer data. I’m trying to mine the particular gene BRCA1 or BRCA2 to find the interesting patterns in it. Research direction is correct or not? Kindly give your valuable suggestios

        • It depends on what is your goal and what is the result. You can try it. If you can find some interesting patterns that are useful to understand cancer, then it will be interesting. Ideally you could find some patterns that you would validate with a domain expert (e.g. doctor) to show that these patterns are interesting.

          Otherwise, you can work on a more technical level and work on improving the algorithms to add more features that are needed for this type of data.

  8. First of all thanks for this amazing article.
    I recently learned about SPMF and wanted to use it on my data, unfortunately I didn’t succeed to apply it properly.

    My data is a time series and look like this (DATE TIME , Name):

    2009-03-12 14:15:17 , James
    2009-03-12 14:15:19 , James
    2009-03-12 14:15:07 , James
    2009-03-12 14:17:11 , Mike
    2009-03-12 14:25:15 , Julie
    2009-03-13 14:32:39 , Thomas
    2009-03-13 14:48:46 , Mike
    2009-03-13 14:51:23 , Madona
    2009-03-13 14:59:47 , Julie

    And so on.
    I am searching for any possible pattern in this data.
    More precisely I want to know “When Julie sends a message?”.
    And if there is any relation between Julie’s and others’ behavior.

    There are also the assumptions below:
    1. I do not know if there is any relation or not.
    2. Always there might be new people joining the chat room.
    3. There are several “usual guests” who are always taking part in conversations and are very talkative.
    4. We do not know anything about these people (and their behavior). They are just a “name” for us.

    Now my question is: Is there any possibility to use SPMF to extract patterns from this time series, which could potentially be used to predict Julie’s behavior according to the others’ behavior?

    Thank you again for this wonderful blog.

    • Hi, Glad you like the blog post and the blog 😉

      Yes, this is a time-series but if you ignore the timestamps, you can also see it as a sequence and apply sequential pattern mining algorithm. The key is that if you can convert your data to the same format as some existing problem, then you can use the existing algorithms.

      In your case, you could view each day of the chat room as a sequence: a sequence for 2018-01-01, a sequence for 2018-01-02, a sequence for 2018-01-03. Then each sequence would be like (mike_join, mike_post_message, julie_join, julie_post, mike_quit, julie_quit…) Thus, you would have a sequence database with a sequence for each day, and then you could apply sequential pattern mining or sequential rule mining to find patterns common to severa days… Or maybe that a day is too long. Actually, if you use days, then the sequences will be very long and the algorithm may not work well, so perhaps that dividing by hours may work better.

      But I am not sure that pattern mining is the best technique for this problem.

      But if you convert your data to sequences, you could also try to the sequence prediction algorithm that can be trained with sequence to predice what will happen next for a new sequence.

      This is just some idea. Actually, your problem may be a bit more complex, especially if you also want to consider timestamps!

      Best,

      Philippe

      • Thanks for the detailed reply.

        I will give the “sequential pattern mining” a try and see if it provides me useful results. However, I still need to know learn what to do with the noises. For example when the internet connection of some users once in a while disconnects and users logout/login randomly.

        For the learning purpose, I still would like to know if there is any possibility to detect temporal patterns in this type of time-series.
        For example, can we somehow come up with the below (imaginary) model:

        IF (James AND [+2 minutes] Mike) THEN ([+8 minutes] Julie).
        IF (Thomas AND [+16 minutes] Mike AND [+3 minutes] Madona) THEN ([+8 minutes] Julie).

        Is there any technique which could be useful in such scenarios? Also in presence of noises?

        • Hello,

          Glad my reply is useful. In my opinion, if you consider the time, it becomes much more complicated because you then need to decide whether (James AND [+60 seconds] Mike) is the same as (James AND [+61 seconds] Mike) or (James AND [+65 seconds] Mike). If you remove the time, then they become the same pattern. But if you consider time… then either all of these patterns are considered as different (but it is probably not what you want…) or you need to find a way to define what is similar such as 60 with more or less 5 minutes is considered as similar… Most paper that I have read in sequential pattern mining cannot do that. But it does not mean that it cannot be done. You could always design a new algorithm and you add some news of handling the time with more flexibility and write a paper about it.

          An example of paper that consider the time in sequential pattern mining is:
          http://www.philippe-fournier-viger.com/spmf/hirateyamana.pdf
          However, for that paper, A +10 seconds B will be seen as different from A +11 seconds B for example…

          Best,

          • Thank you very very much for your answers.
            Now I have a better picture in mind and will follow your suggestions.

            Bests 🙂

  9. Hi Philippe,
    I have been reading your survey paper that you attached in the blog post. In the page 60 of the paper, I’ve read something like ” For example, the item-positions
    of the pattern hfagi in the sequence s2 of Table 1 is f(s2; 1); (s2; 4)g, indicating that the pattern hfagi
    appears in the first and fourth itemsts of sequence s2. “. And follow the logic of the previous sentence in the paper, I would think the statement would be ” For example, the item-positions
    of the pattern hfagi in the sequence s2 of Table 1 is f(s1; 1); (s1; 4)g, indicating that the pattern hfagi
    appears in the first and fourth itemsts of sequence s2. “. Hope you can consider them and show me if I’m wrong or not. Thanks.
    Best,

    Nam

    • Hi Nam,
      Thanks for reading the blog and the survey, and for your feedback! I have checked it and there is indeed an error in the definition.
      It should be:
      The item-positions of sa in sb, denoted as ip(sa, sb), is the set of pairs of the form (sb, ik),.
      Then, after that the example is correct. Thanks for letting me know. I have fixed the PDF on the website.
      Best regards,
      Philippe

  10. Hi Philippe,
    I have a question regarding the impact of sequence ( {a,b},{c}) differs from ({a},{b},{c}) in final sequence mining.

    • Hello,
      Yes, these two sequences are considered different.
      The sequence {a,b},{c}) means that “a” and “b” appeared at the SAME TIME, and were then followed by “c”.
      The sequence ({a},{b},{c}) means that “a” was followed by “b”, and that it was then followed by “c”.
      So those two patterns represent two different cases.
      Hope that this is clear. And thanks for reading the blog 😉
      Best,

      Philippe

  11. Pingback: Report about the DEXA 2018 and DAWAK 2018 conferences - The Data Mining BlogThe Data Mining Blog

    • Hi, Thanks for reading the blog and interesting topic. However, I don’t have time to look at your data currently. But from your description, if you have sequence of events, sequential mining can perhaps make sense. But I think it requires to check in more details the data, which I cannot do at this moment.

  12. Hello

    I don’t understand this part
    1 ≤ i1 < i2 < … < in ≤ m such that A1 ⊆ Bi1, A2 ⊆ Bi2 … An ⊆ Bin.
    In this example is contained in
    if A1 is {b}, then which one is Bi1?

    • Hi, This is about the definition of subsequences. It is the formal definition so I understand that it may not be so easy to understand.

      The idea is the following. Let’s say that you have two sequences:
      sa = A1, A2, …, An
      sb = B1, B2, …, Bm

      We want to define what it means for a sequence SA to be contained in a sequence sB.

      We say that there must exists some numbers i1, i2, in such that 1 <= i1 < i2 ... < in < m. In other words, these numbers must correspond to positions in the sequences sB (recall that the sequence sB contains m itemsets). Then, we say that sA is contained in sB if A1, A2 ... An appears at these positions. Formally: A1 ⊆ Bi1, A2 ⊆ Bi2, ..., An ⊆ Bin In English that means that all the itemsets of sA are contained in some itemsets of sB and that the sequential order between the itemsets are the same. Hope this helps.

  13. Hi!
    Dynamic specification mining is a technique to infer specifications (in form of an automaton, LTL formulas or similar properties) from execution traces.
    This seems very similar to sequential pattern mining.
    Is there a difference between (dynamic) specification mining and sequential pattern mining?

    Thank you very much!
    Martin

    • Dear Martin,
      This an interesting observation. There many types of patterns that can be extracted from temporal data such partial orders, sequential patterns, sequential rules, LTL formulas, automatas, etc. All of these patterns are extracted with the same purpose, which is to summarize the data or extract some important information from the data. The difference is mainly the definition of these pattern types. And then, because the definitions are different, the algorithm will also be different. Here is a few papers that have used my SPMF software for pattern mining in software engineering:

      Ramdhani, M. A., Maylawati, D. S. A., Amin, A. S., & Aulawi, H. (2018). Requirements Elicitation in Software Engineering. International Journal of Engineering & Technology (UEA), 7(2.19), 772-775.

      Heumüller, R., Nielebock, S., & Ortmeier, F. (2018, September). Who plays with whom?… and how? mining API interaction patterns from source code. In Proceedings of the 7th International Workshop on Software Mining (pp. 8-11). ACM.

      Schulze, C., Cleaveland, R., & Lindvall, M. (2018, June). Automated Specification Extraction and Analysis with Specstractor. In International Conference on Software Engineering and Formal Methods (pp. 37-53). Springer, Cham.

      Zaman, Tarannum Shaila, and Tingting Yu. “Extracting implicit programming rules: comparing static and dynamic approaches.” Proceedings of the 7th International Workshop on Software Mining. ACM, 2018.

      Ramadani, J. (2017). Mining software repositories for coupled changes.

      Udagawa, Y., 2016, November. Maximal frequent sequence mining for finding software clones. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (pp. 26-33). ACM.

  14. Hello Dr. Philippe,

    Thanks for this informative blog.

    I have recently started working in the area of “Sequential Pattern Mining in Collaborative Filtering Based E-Commerce Recommendation Systems”.

    To initiate with, I started studying your paper on “A survey on Sequential Pattern Mining” which provided me very useful details. Also, I studied about different types of recommendation systems such as Content Based,Collaborative Filtering (CF) and Hybrid etc.

    I am thinking to work in the direction of :

    1. How to improve the prediction accuracy of unknown ratings in user-item
    rating matrix for Collaborative Filtering based recommendation models?
    2. How to utilize Sequential Pattern Mining in above (1) to improve the
    recommendations ? e.g by:

    i) Finding the sequence of similarities/relationship
    between items purchased for better recommendation
    ii) Finding the time gap between different purchase
    sequences to know if the items in a sequence were
    purchased long time ago, they may not be suitable
    for recommendation
    iii) Finding the weight of each sequence or items in the
    sequence which can help in prioritizing purchase
    sequences

    If you can share your opinion on this,I will be grateful.

    • Hi, It seems like an interesting topic and thanks for reading the survey paper on sequential pattern mining 😉
      About e-commerce recommendation, I did not read much on this topic. But I know that some researchers have used sequential patterns for other types of recommendation. For example, a while ago I read some people using sequential patterns for studying the places that tourists will visit in a city. In that case, a sequence is a sequence of GPS location, and there is the problem of Point-of-interest recommendation. This is obviously a different problem from e-commerce recommendation but I think that you can have a look at such problem. Maybe it can give you some inspiration for your problem. I think that it is good that you are not just considering sequential pattern mining but that you also want to add additional information such as weights, etc. If you make the problem more rich by adding this additional info, your approach will be more original and may have better performance perhaps.
      By the way, if you want to take the user profile into account, you can also check multi-dimensional sequential pattern mining (SeqDIM, DIMSeq). Those algorithms are implemented in my SPMF software. You can also try them. Regards.

  15. Thank you Dr. Philippe for your detailed response and providing perspective of multidimensional sequential pattern mining. I will look into these as well.

    Thanks & Regards

Leave a Reply

Your email address will not be published. Required fields are marked *