About the author

Philippe Fournier-Viger, Ph.D. is a data mining researcher Philippe Fournier-Vigerand professor.  He use this blog as a personal website to discuss his thoughts, opinions and ideas about data mining and research in general.

He is the founder and main author of the SPMF data mining software, an open-source software offering more than 150 algorithms for discovering itemsets, association rules, sequential patterns and rules in sequences and transactions. The SPMF software has been cited in more than 640 papers and was visited by more than 700,000 visitors since 2010. He has also written or participated in more than 200 research papers which have received more than 3000 citations. He is one of the two editor-in-chief of the Data Science and Pattern Recognition journal.  He edited the book “High-Utility Pattern Mining” (Springer).


Website:  http://www.philippe-fournier-viger.com
LinkedIn: http://www.linkedin.com/profile/view?id=63632045&trk=tab_pro
Twitter: https://twitter.com/philfv
ResearchGate: https://www.researchgate.net/profile/Philippe_Fournier-Viger
Google Scholar profile: https://scholar.google.com/citations?user=QG_7KjoAAAAJ&hl=en&oi=sra
DBLP profile: http://dblp.uni-trier.de/pers/hd/f/Fournier=Viger:Philippe

(Visited 1,423 times, 2 visits today)


About the author — 98 Comments

  1. Please give me some research guidance.I want to work on sequential frequent pattern generation.Could you please suggest what can be done

    • There are many possible topics.
      – make a faster algorithm, a more memory efficient algorithm
      – make an algorithm for a variation of the problem such as mining sequential rules, mining uncertain seq. pat., mining seq. pat. in a stream, mining high utility seq. pat. mining weighted seq. patterns, mining fuzzy seq. pat., incremental mining of seq. pat., incremental mining of uncertain seq. pat. , etc. I mean you can combine several ideas and define a new algorithm. Or you can make something faster or more memory efficient.

      I think that the key is to read the recent papers and to see what has been done what has not been done. Then you will get some ideas about what you can do.

      Me, I have a few good ideas related to these topics but it takes time to find a good idea so when I find one, I keep it for myself.

  2. please guide me about Data Mining in Blood Transfusion.I want to work on Blood transfusion Data (Data is include of hcc,hbs and like this tests of Donors)
    please guide me

  3. I like to know about a data mining technique /algorithm that will be better for predicting
    certain outcome. eg. selecting appropriate technique for Software requirement Elicitation. I need it for My masters thesis. Please tell me which algorithm is easy to use and also effective and what language to use for the problem.

  4. Hello Sir

    i am doing my master in computer engineering and interested in data mining for my thesis. I was thinking about big data but i assume it will take alot of time to finish or do something in this field. and its only masters not phd and i have 6 months to do something and publish a paper. So i am thinking of.. So please suggest me a specific topic which i can do my research and some useful references. It will be really nice of you thanks!

    • I recommend to talk with your supervisor about choosing a topic, and choosing a topic that is not too complicated. Six months is quite short.

      Perhaps that you should try to find some datasets and then do some more experiments with the datasets. Or choose an algorithm and try to improve it. I don’t have too much ideas. Actually, choosing a topic can take a lot of time. Therefore, I cannot do the search for you.

  5. please, can you recomended me any good resource for exploring techniques of data mining that evaluates their applications en interesting cases?
    Could integrate a research group that you guide, as I can do?

  6. Hello sir,
    My name is T Ramesh from india, working as a Asst.Professor in one of the engineering colleges. Recently i have registered for Ph.D in data mining. Currently i am looking for issues in data mining. I thought doing in “Privacy in social networking sites”. Would you please whether it has a scope or not in future? Or would you please any other issues to work on? for this i would be very thankful to you. An also one of friend got registered Ph.D and he has Graph mining as his area of research. Would you please suggest recent issues in graph mining to work on?

    • Hi,

      Graph mining and privacy in social networks are good and popular research areas. So I think that choosing these research areas are good. But you will still need to more precisely define your topic. To do that I cannot help you much since as i said in the blog post, it requires to do a litterature review, which takes times, and I only have time to do that for my students. Just look at recent papers in good conferences and you should find out what are the current challenges.


  7. Hi sir,

    This is Manjunath from Karnataka, india. I have registered for Ph.D at VIT University, Vellore. I’m interested to work on social networks using graph mining. can you please suggest recent research topics on these. i would be very thankful to you if you help me in this regard.


    • To know the recent topics, you need to read papers from the recent conferences and journals on data mining / social network mining. For example, you could have a look at papers published at ASONAM 2014.

  8. Hi Philippe,

    May i just say that your article on how to choose a good topic was excellent and useful.
    I was wondering if you could email me as i wish to ask you a private question.

    Kind regards,

  9. hi Sir,
    can you please tell me is data set of about 30k is enough for analysis in the thesis of Master level?

    • Even if you have a datasets of 100 or 1k it can be enough. It all depends on what you are doing with the data. If you want to test the performance of an algorithm, you need more data. But if you want to do something more applied such as predicting which students will dropout of school, having just a few hundred students data may be enough for showing that your approach works.

  10. Hello Dr.I’m preparing to embark on a PhD,and I have always think about applying some data mining to telecommunication data(specifically customer calls).
    I’m thinking about rating/identifying the quality of service delivery by different service providers from their past customer calls data.Or if we can identify a pattern that could distinguish one service provider from the other from their customer calls.
    Could that be sufficient for a PhD research work?
    Thank you

    • Hi, This topic can be good. If you can get some real data, it will be more interesting. In a Ph.D., a student will generally work on a very narrow topic but will become an expert on this narrow topic and investigate it very deeply. So, I don’t see any problem about choosing this. But obviously, you should also do a litterature review to see what other people have done with call/customer data, and to more specifically see what can be done that is original. Good luck!

  11. Hi,
    I am trying to find a phd research problem related to discovering (retrieving information) pattern from large scale data (big data) for text mining. Could you please suggest what can be done in this? I am totally confused whether to improve algorithm or just apply the algo in the data set and show the results…..please guide me

    • I think both are possible. It is possible to make a better algorithm (faster, more accurate, etc.). Or it is possible to show that the algorithm gives good results for some particular application of text mining. If you design a better algorithm, you need to be probably better at programming and then your contribution is about algorithm design. If you choose the second option, then your contribution is for a specific application domain, and then you need to better explain how it is useful for your application, and probably also compare with alternative method that do not use patterns to see how your approach improve the result. Or you could do both. This is just some general ideas. You actually need to do a literature review to find a good topic.

  12. Dear Sir,
    I am Yogalakshmi and I am working on project related to sequence mining. I was exploring SPMF framework. I have some doubts with respect to representation of datasets for sequence mining in general. Is it possible that in all my sequences, I only have a single itemset with many items in it. For e.g:
    1 {A1, A2,A3,A4,A5,A6,A7,A8,A9,A10}
    2 {A4,A5,A11,A12}
    3 {A13,A8,A9,A10,A16,A17,A18,A19,A20}

    • Yes, it is possible to have a single itemset per sequence.

      However, to be considered as a sequences database, items whithin an itemset are assumed to be simultaneous, and an item may not appear twice whithin the same itemset.

      Besides, if you have a single itemset per sequence, there is no notion of before or after anymore. Thus, your sequence database is actually a transaction database, and it may be better in that case to use freuent itemset mining or association rule mining algorithms rather than sequential pattern mining algorithms.

  13. hai sir

    how to select the research problem in utility mining in data mining, give me some suggettions sir.utility mining with negative item values.
    how to select the problem give some examples and what datasets are used.

    i have one doubt sir in utility mining, who achieved the utility either user or vendor
    tell me some applications of utility mining how they uses the utility mining

    • Hello,

      The traditional example of high-utility itemset mining is related to a retail store. We assume that the utility is the profit that the retail store obtained by selling products. The input is a transaction database, where you have many transactions made by customers. Each transaction indicate the items (that a customer has bought), with the corresponding purchase quantities. For example, a transaction can indidate that a customer has bought three units of beer, and two units of bread. Then, besides that each item has a unit profit (the amount of money that the retail store earn for each unit sold). For example, each beer may generate a 1 $ profit, while each bread may generate a 0.4$ profit. Globally, given a database with many customer transactions, the goal is to find the group of items that generate a high profit for the retail store.

      This is the standard definition of the problem of high-utility itemset mining. However, it is easy to apply this definition in many other domains. For example, you could apply high-utility itemset mining to analyse the webpage viewed by users. In that case the utility could be defined perhaps in terms of the time spend on each page and the importance of webpage. There can be also many other applications. A few of them are listed in the literature.

      You can find many datasets on the SPMF library webpage: http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
      Besides, on the SPMF webpage, you can also find implementations of many of the most popular utility mining algorithms.

      Now, finding a topic requires to read some papers to know what other people are working on. Before choosing a specific topic, you need to read at least a few papers to know what other people are doing. And you should also ask your supervisor to help you find a topic. There are a lot of possibilities. For example, no paper has yet
      addressed the following topics:
      – discovering rare high-utility sequential rules
      – discovering uncertain high-utility sequential rules
      – discovering uncertain sequential rules (not related to high-utility)
      – etc.
      This is just a few example that I can think of. Actually, you can generate a new topic by combining any existing two topics such as:
      rare itemset mining + high-utility sequential rule mining = rare high-utility sequential rule mining.

  14. hai sir
    what is on-shelf utilty mining give me some examples and applications of onshelf-utilty mining

    • On-shelf high utility mining is similar to high utility itemset mining but we consider the shelf time of products (when the products are on sale and when they are not on sale). For example, imagine that you have a retail store in a country where the weather is hot during the summer and cold during the winter. During the summer, the retail store may sell the swimming suits and other products related to the beach. But during the winter, the retail store may not sell swimming suit. It may sell products related to winter sports. Besides, there are also some other products that are sold during the whole year such as “milk” are bought all year round.

      In on-shelf high-utility itemset mining we want to deal with the information that some products are sold only during some part of the year. The idea is that it is not fair to evaluate all the itemsets (or sets of products) by calculating the profit over a year. The reason is that some products are sold only during 1 months, 2 months… while some products are sold during the whole year for example. Thus, these products don’t have he same opportunity to generate a high profit.

      The different between high-utility itemset mining and on-shelf high utility itemset mining is that the latter consider the time periods where items are sold to provide a more fair measurement of the profit of itemsets. Actually, it is a very interesting problem. And there are several possibilities to extend that.

      You can read my paper about on-shelf high-utility-itemset mining here:
      It will give you all the details.

  15. hai sir
    i am D.Srinivasa Rao research scholar. please guide me or suggest me to discover the problem in utility mining. shall i proceed in high utility itemset mining with negative item values.or shall i procced in on-shelf utility mining with negative item values. suggest me is it better to proceed in that direction in utility mining.if it is good direction how to select the problem give me some suggestions.

    • Hi,
      I gave you a few suggestions when I replied to your other comment.

      On-shelf high-utility itemset with negative item is an OK topic. But there exist already some algorithms for this. So in that case, either you can extend them further to do more things, or you can try to make a faster algorithms.

  16. Hi Sir,
    I am M.Tech student. I am doing project in the field of data mining.I am unable to find the differences among periodic patterns, popular patterns and frequent patterns. When i read about the examples of each patterns , it seems to be differentiate, but i can’t able to differentiate them when comparing. Please give some examples of each patterns, where we can apply these patterns in real world. Suggest me some books related to pattern mining.

    • Hi,
      A periodic pattern is something that appears periodically. For example, if you analyse customer transactions in a super market, you may find that some people buy milk and bread with cheese every week. It is said to be periodic because it appears regularly after some period of time (e.g. every week). That is the basic idea about periodic patterns. Some real work application is to analyze customer transactions. But it could also be applied to analyze many other type of data such as analyzing the electricity consumption of users, etc.

      A frequent pattern is some data that appears regularly in a database. For example, if you analyze again customer transactions you may find that people have frequently bought some items such as salsa sauce with tortillas and beef. Frequent pattern mining has a lot of applications beside market basket analysis. There are hundreds of applications and it would be too long to list all of them. Besides, there are many variations such as finding frequent sequences, frequent sub-graphs, etc.

      By the way, you can also combine different kind of constraints to find patterns. For example, it would be possible to find frequent periodic patterns.

      For a good book about pattern mining, you may “Frequent Pattern Mining” by Charu Aggarwal, which gives a good overview and is quite recent.

      You may also read chapter 4 of “introduction to data mining”, which is quite easy to understand, but is a little bit old though.

  17. Dear Phillipe,
    I am studying about sequential pattern data mining from data sequence for my master thesis. So, I really need SNAKE dataset but it not public for me. Could you send me SNAKE dataset? I am very grateful for your support.

  18. Hi sir,
    I am a M.Tech student learning about privacy preserving utility mining for thesis. For that i need to extract high utility patterns, given transaction database and external utility table. But i need help for the implementation of the algorithm ( up-growth or up-growth plus). Could you please help?

    • Hello,

      Why do you want to implement UPGrowth? UPGrowth is an old algorithm and much faster algorithms have been proposed. For example, HUI-Miner, FHM and EFIM can be more than 1000 times faster than UPGrowth. Besides, algorithms like HUI-Miner and FHM are easier to implement than UPGrowth.

      If you want to see the source code of these algorithms, you can check the SPMF open-source data mining library which provides the source code in Java.

  19. Hi , I am trying to use your FP growth code for my transaction. but its not working can you please help me.

  20. I was searching for data mining journal and I found ” Data Science and Pattern recognition”.
    Sir, I have published a paper in a Springer conference which is published in LNCS series.
    Now, I want to publish the extended version of the paper. Can I publish the paper in your journal?..

    Reply as soon as possible..

    • Hello, Yes, you are welcome to submit your extended version. I think that if your paper was published in LCNS it should be quite good. For the extended version, try to add more content such as: more references, more related work, more experiments and in general more details. In general for an extended paper, there should be about 30%-40 % more content than the conference paper. It is best if you can submit the paper before June to be in the next issue.

      If you need suggestions for how to extend the paper or something else related to submitting to our journal, just let me know. You can even send me the paper to my e-mail before submission so that I can give you some advices/suggestions, if you need some.

      Currently, there are not a lot of submissions yet, so I think that you have a high probability of being accepted in the next issue 😉

      • >> I think that if your paper was published in LCNS(sic) it should be quite good.

        Thanks for your posts. These are informative.
        I was wanting to know how reputed is publishing in LNCS. For instance, take the case of ‘Graphic Recognition’ series in LNCS. If available, is there any references for the same?

        • Hello, Thanks for reading the blog. LNCS and LNAI are series of conference proceedings books published by Springer. Some very good conferences are published in books of the LCNS and LNAI series. But there are also some conferences that are good or just average, which are also published by LNCS and LNAI. But in general, you will never find a very weak conference in that series. LNCS and LNAI ensures some minimum level of quality, although the level is variable.

          I published many papers in LCNS and LNAI because what is good is that those book series are automatically indexed by several publications databases, which gives visibility to the papers. For example, the LNCS/LNAI series are indexed by EI, DBLP, SCOPUS, and several other indexes. Personally, I only publish in Springer, IEEE or ACM for conference papers.

          IEEE is good, but one should be careful that there are the official IEEE conferences and those that are sponsored but not official IEEE conferences. These conferences can be sometimes very weak. With Springer, they are some less good conferences, but I think that the level of the worst Springer conferences is better than the worst IEEE sponsored conferences. Just my opinion.

          After that, you should not just look at the publisher but look at how famous a given conference is. This is actually more important than choosing the publisher.

  21. Hello Sir,

    I would like to analyse the compact prediction tree (+) in depth and also read your papers about them. Can you recommend any further literature?
    I also didn’t understand some points regarding the training sequences. So I would like to know where to I get them to a specific prediction task?

    For example the webpage prediction task. If a user visits webpage A,B,C in that order and now we want to predict which webpage is going to be visited next from that user in order to prefetch the data, on which “training sequences” are we operating so we can make conclusions for the next sequence/item/webpage ?
    I would really appreciate any help.

    Kind regards

    • Hello,
      I think that there are many things that can be improved with regards to the CPT+, in particular by extending it for other problems. So it is a good topic for research.

      About CPT+, there are mainly two papers:

      Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V. S. (2015). CPT+: Decreasing the time/space complexity of the Compact Prediction Tree. Proc. 19th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 2015), Springer, LNAI9078, pp. 625-636.

      Gueniche, T., Fournier-Viger, P., Tseng, V. S. (2013). Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction. Proc. 9th Intern. Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 177-188.

      Besides, you can get a 63 pages Powerpoint explanining CPT here:


      Besides that, you can get the source code in Java of CPT, CPT+ and the other algorithms used in these papers, as well as the datasets used in these papers, in the SPMF library:


      So, if you are not sure how it works, you can also check the code.

      2) For prediction, you need to train the model. If you have data from many users, you can use this to train a model. And then you can use the model to do some predictions either for the same users or for some new users.

      Another way is that if you have a lot of data about a single user, you could train a model specifically for that user and use that model to make some predictions only for that users. Or if you have multiple users, you could create multiple CPT(+) models (one for each user).

      But I think that in real-life you don’t necessarily have a lot of data for each user. So sometimes, you may use the data about all users to train your model. It really depends on what you are doing. For example, if you use CPT+ to predict the next words that someone will type on his cellphone, then you certainly have enough data to train a model for the user, because the user may always be typing on his cellphone, and each sentence could be a sequence.

      So CPT(+) can be used in any of these ways, because in CPT(+) there is no concept of users. There is only the concept of training sequences and testing sequences (for prediction). The training sequences can be either from the same user or different users. It depends what kind of data you have.

      Best regards,

  22. Sir I have used your software(SPMF) and its fantastic. Bundle of thanks for this. I just need to ask a question: I am using Apriori to generate rules on my dataset. Now i want to check(test) these rules on testing data. Is there any support in your software to do so? If it exists then please tell me, it will be very nice of you.

    • Hello, thanks for using SPMF. The algorithms for discovering rules will exactly find all the rules that you request. For example, if you set minconf = 50 %, then the algorithm will find all the rules having a confidence greater or equal to 50 %. But, yes, after that you may use the rules in many different way to “check (test)” these rules. And how to check if a rule is really good depends on what you want to do with it. For example, it could be used for recommendation or other things. Thus, in SPMF, there is no such function for now. But if someone wants to implement for example classification using association rules, then it could be added to SPMF. Or if someone wants to implement something like clustering using association rules, then it could also be added to SPMF. Best regards,

  23. Dear Sir,
    I am a faculty at an University in India. I have registered for second Ph.D. in Computer Science taking the broad field of Data Mining. I want to do in Learning Analytics or Educational Data Mining. Can you refer some topics or references for choosing my topic. I wish it should be useful to the students community and Universities. I extend my appreciation to you for you answering all mails with profound interest and patience. Thank You.
    Dr. K.S.Ramakrishnan

    • Dear Dr.,

      Thanks for reading and posting on the blog. Although I have worked in the field of EDM, I have not worked on this topic for about 7 years. Hence, I am not familiar with the latest research in EDM and learning analytics. I think the most important is to look at the recent papers published in the top conferences in this field like : Artificial Intelligence in Education, Intelligent Tutoring Systems, EC-TEL, ICALT, EDM, LAK, and others. For journals, some journals are Journal of Artificial Intelligence in Education, IEEE Transactions on Learning Technologies, etc. You can also find recent papers in Google Scholar, and maybe also you could check if there are some survey papers about this field in recent years. I think you need to get a good overview of recent research by reading papers from that field before choosing your topic. Preferably, I recommend to read papers from the last 5 years. Regards,

  24. Sir, I am intrested in learning FcloSM(for mining closed frequent sequences) algorithm .Can you provide me the trace of the algorithm ,so that i will be able to understand it better.
    Is there an open source implementation of the algorithm ? I have searched in the SPMF software but the implementation of the algorithm is not available.
    Thank You.

    • Hi, I know that the paper is not so easy to understand because it is quite formal but all the details are there. Typically I always share the source code of my papers. But I am not the main author of that paper and not the corresponding author either. So I would recommend to contact Prof. Tin Truong Chi to ask if he would share the source code with you. You can also ask him if he has other material to explain the algorithm. That is my suggestion.

  25. Hello Sir,
    I am PhD student from Gujarat…i m doing my research work in association rule mining in stock market domain…i m working with CSV file which contains information about stock prices and stock indices….but both these data are encrypted ie stock price is concatenated with stock name…eg AXIS10.3 i want to use apriori in spmf…but when i m trying to use this file, i m getting “NumberFormatException.forInputString(NumberFormatException.java:65)” error..can u please guide me in solving this error

  26. Hello sir,

    i want to work on top-k periodic frequent patterns mining. could you suggest me relevant papers.I am very grateful for your support.

    • Hi,
      If you want to work on itemsets, maybe the easiest is to start from my paper:

      Fournier-Viger, P., Lin, C.-W., Duong, Q.-H., Dam, T.-L., Sevcic, L., Uhrin, D., Voznak, M. (2016). PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures. Proc. 2nd Czech-China Scientific Conference 2016, Elsevier, 10 pages.

      You can get the PDF here: http://www.philippe-fournier-viger.com/PFPM_mining_periodic_patterns.pdf
      You can get the source code and some datasets here: http://www.philippe-fournier-viger.com/spmf/

      A possibility could be to transform that algorithm into a top-k algorithm. To do that you need to combine the idea of periodic pattern mining with top-k.

      To make a top-k algorithm is not very difficult. If you want to find the top-k most frequent pattern, you must start from minsup =0, and then start to search for patterns. Then, at the same time, you keep a list of the k most frequent patterns until now. When you have at least k patterns, you raise the minsup threshold to the lowest support among the current top-k patterns and continue to search. Then, when the algorithm terminates, you have the top-k patterns. To do that efficiently, you can use some structure like Heap or Red-black tree. I try to explain this very quickly, but you can check on my webpage for example (http://www.philippe-fournier-viger.com/publications.php ) . i have done many top-k algorithms, and you can also get the source code in SPMF to see how they work.

      By the way when you say top-k, you need to think top-k in terms of what? the k most frequent periodic patterns? or define top-k using some other measures?



  27. Hello good day Philippe Fournier, I am from Peru I have a query, you will have more information about the SSFIM algorithm, I want to use with MapReduce to improve the classification in data mining, some suggestion to this topic for research.

    • Hi Yonatan, Glad to see some message from Peru. The SSFIM algorithm is a quite simple algorithm. Basically, it consists of just counting the support of all possible itemsets in a database. I think it would thus be easy to transform it to map reduce.



      • Good day, the topic of thesis for a master’s degree in computing is to make a framework using SSFIM with MapReduce approach, what would you say about the issue, would it be viable? What indicators could you consider for the tests ?, I greatly appreciate your response . I’m new to data mining issues.

        Thank you,

        • You can certainly transform SSFIM to map reduce. Now, will it be an efficient algorithm? I don’t know. You would have to compare the speed of your map-reduce version of SSFIM with other big data algorithms for frequent itemset mining. Then, you could tell wheter your algorithm is better or not. But although SSFIM is a new algorithm, I think it may not be so efficient. The reason is that it generates all the possible itemsets from transactions in a database. For some datasets, this can be very slow. For example, in my data mining software SPMF, I have implemented SSFIM and some other algorithms like FPGrowth. But FPGrowth and some other algorithms can be more than 100 times faster than SSFIM… So is it a good idea to transform SSFIM to big data? I don’t know. You would have to test it. But personally, I would maybe be worried about the performance because this algorithm has to generate all the possibilities and do not prune the search space.

          • Thanks for the reply.
            I just reviewed your article of algorithms to find frequent elements http://data-mining.philippe-fournier-viger.com/classic-data-mining-algorithm-1-apriori/, my idea is to propose a framework based on map It reduces using one of the algorithms to find frequent elements, and to find that the execution time is much better using map reduces, on the other hand also find the precision of the element that is repeated many times. My idea is for that topic, some recommendation about this.

            I’m new to data mining, maybe some other recommendation for a master’s thesis with data mining research.
            I am very grateful for the answers you give me, I do not have perfect English, my language is Spanish. the apologies for the writing, greetings from Cusco, Peru.

  28. Hello Professor,

    I am doing my master in computer engineering and interested in data mining for my thesis. I was thinking about using sequence pattern mining in choosing the elective course for the undergraduate student. My questions are, do you think that using real data with anonymous are allowed? because in the end, I will publish it as a paper. And the last question, do you have any suggestion what kind of improvement to my research plan to be a thesis topic that no need so much time because I only have less than 1 year. It will be really nice of you thanks!

    • Hi, Interesting topic. 1) About using anonymized real data, you need to check the policy of your university about conducting experiments with humans or if there is a policy about using data of people. For example, I know that some universities will not allow to use data about people (anonymized or not) unless all the students sign a waiver document to agree that you use their data. 2) I just recently attended a presentation related to your topic at DEXA 2018 by Osmar Zaiane et al. called “Sequence-based Approaches to course recommender systems”. I think you should check that paper. They use a few approach for course recommendation, including sequential patterns. But they do not use the real data due to the problem of obtaining the privacy agreement.

      Best regards,

      • Thank you for the reply. If I have to generate dataset that represent of real data consist of student course variable like what courses they take in every semester and what score they get, could you suggest me a method to make it ?

        • Hi, there are many datasets online for e-learning that are public. I think you should try to search to obtain such datasets. For example PSLC datashop has many e-learning datasets, or you can check Kaggle. Maybe they have what you need.

          Best regards

  29. Hi sir, is it correct way to compare sequence prediction and patter mining.?
    also recommendation system with it.?

    • Actually sequence prediction is a task. You want to predict the next symbol of a sequence. There are many way to do sequence prediction. One of these ways is to use patterns to do the sequence prediction. Thus, you can compare different way of doing sequence prediction, and pattern mining can be one of the ways. You could use sequential patterns or sequential rules to do sequence prediction for example. And yes, there is also some relationship between sequence prediction and recommendation. Actually, you can think of a prediction as a recommendation. For example, if we have a sequence of book that you purchased, we can do sequence prediction to predict the next book that you will purchase. Then, this can also be considered as a recommendation.

  30. Hello, good morning, Philippe Fournier, I’m from Peru. I have a query, read your article http://www.philippe-fournier-viger.com/Survey_Itemset_Mining.pdf, where he mentions the importance of continuing to research frequent elements on a massive database, re-screened a lot of research at IEEE and every time improves the algorithms such as APRIORI, ECLAT and FP-GROWTH. My question is, why do you still choose the Apriori algorithm to improve performance? and why not choose another algorithm if you know that FP-GROWTH is better than Apriori.

    Apologies for English, I do not have perfect domain.

    • Good morning,

      Thanks for reading and commenting on the blog. Yes, that is a good question. I also noticed that many new algorithms for new pattern mining problems are still based on Apriori. Technically, such paper can be accepted because if a paper presents an Apriori-based algorithm for a new pattern mining problem, then it is the first algorithm for this problem, and it cannot be compared with other algorithms.

      But it is obvious that an FP-Growth based algorithm will generally always be faster than a Apriori-based algorithm. So why people still choose Apriori? There are two main reasons. First, Apriori is quite simple to implement compared to faster algorithms like FPGrowth. I would say that the difficulty of programming Apriori is 2 on a five point scale while FPGrowth is perhaps 4 out of 5. Second, some authors choose Apriori because they want to write an algorithm that is not fast to then write another paper with a faster algorithm based on FPGrowth for example. This is a way of making many papers with small improvements. Many people do this. But I don’t think it is a good idea.

      And then, there are some algorithms like Eclat that are quite easy to implement, are faster than Apriori but slower than FPGrowth. Thus, many people are currently designing algorithms based on Eclat. I think that this is somewhat reasonable because performance is always important but making an algorithm that is not too hard to implement can also be an important criterion for the user in some cases. Actually, many researchers on pattern mining have focused on the performance but actually I think that we should put more focus on finding patterns that are really interesting or useful. In fact, many papers evaluate algorithms only based on time and memory, but ignore whether the patterns found are useful for the user.

      Best regards,


      • Good morning professor, Philippe
        If I do a research work where it is proposed to improve the APRIORI algorithm, how to choose which algorithms I should compare, in order to carry out the proposed APRIORI (improved) algorithm experiment, I should compare it with other algorithms.
        Many investigations compare APRIORI algorithm with Eclat, MRApriori, Apriori Spark, and thus choose other improved algorithms.

        My question is how do I choose which algorithms should I compare the time performance with respect to the proposed (improved) APRIORI algorithm?

        Thank you.

        • Hello, I think that improving the Apriori algorithm is not a good research topic. The reason is that Apriori is very slow and some algorithms like FPGrowth can perhaps be 1000 times faster. Thus, even if you make Apriori twice faster, its performance still not be good compared to the newer algorithms like FPGrowth, LCM etc. I think that the only reasonable way of working on Apriori is if you define a new pattern mining problem. Then, adapting Apriori can be justified because there are no other algorithms for your new problem. In other words, if you design an algorithm for a new pattern mining problem, the algorithm does not need to be very efficient and cannot be compared. But if you goal is improve the Apriori algorithm for frequent itemset mining (an existing problem), then you would have to compare your improved Apriori algorithm with the BEST frequent itemset mining algorithms, and most likely Apriori-based algorithms cannot beat those algorithms.

          > My question is how do I choose which algorithms should I compare the time performance with respect to the proposed (improved) APRIORI algorithm?

          Generally, you should always compare with the best algorithms published until now for the problem that you are interested in. If you design an algorithm for frequent itemset mining, then you should compare with the best frequent itemset mining algorithm. If you design an algorithm for uncertain itemset mining, then you should compare with the best algorithm for that problem. Or if you design a new problem, then maybe you cannot compare with anything else because you are the first one to address this new problem.

          Hope this is clear

          Best regards

  31. my PHD in time series data mining and do the classification on ECG data and simulated time series data from auto regressive and bilinear models using motif by r and use sax approximation
    is this point of research new and good??

  32. Hello Philippe Fournier,

    first of all i want to thank you for your very good overview about the Sequential Pattern Mining Topic (A Survey of Sequential Pattern Mining).

    I´m currently working on my bachelor thesis about the usage of Sequential Pattern / Sequential Rule Mining-techniques for automotive predicitve maintenance.

    – >I want to answer my research question what amount of data is necessary to
    extract meaningful / significant Sequential patterns / rules in a
    sequential database.

    I don´t have access to real-life automotive predictive maintenance data, so i thought about taking an webpage click-stream dataset.
    So my consideration is to take an sequential dataset and then gradually minimize it.

    Let‘s assume that with an dataset of 1000 sequences it´s possible to extract 50 sequential rules with confidence 0,8, and with half of the data (500 sequences) it´s only able to extract 35 rules with confidence 0,8, with 50 sequences only 10 rules, and so on
    I think i will use SPMF-library and only change size and variability of the data set.

    Do I think correctly or are my thoughts wrong?
    Have you already published something on that subject or can you recommend good Paper / books?

    I would very appreciate an answer,

    Best regards

    • Hi,

      Thanks for reading the survey. It is an interesting topic.

      If you don’t have data, maybe that the best option is to use something that is as close as possible to your topic. For example, if there is no data about automobile maintenance, maybe you can still get data about airplane maintenance, or something else. But if none of that is available, then you could use other data I think like the click-stream (as you mentioned).

      Yes, you could change the size of the data and observe the number of rules. But as you do that, you should note that it is possible that the number of rules may increase also as you decrease the number of sequences, but most likely, it will generally decrease. The number of rules could increase if you use a minimum support as a percentage and when you remove sequences, the sequences become very similar, thus containing maybe more patterns.

      Another thing to consider is to look at the support of the rules and think about whether these rules have a support that is high enough to draw some conclusions that are significant. For example, if you decrease to 50 sequences, but you find 10 rules and each of them only appear in 2 sequences, maybe that a support of 2 would not allow to draw some significant conclusion and it could just be noise in your data. So this is another problem if you don’t have enough data… then maybe the patterns are not reliable because the support is too low. There are some approach in pattern mining such as the Skopus algorithm in SPMF that are designed to find sequential patterns that are statistically significant. With that algorithm, if the number of sequences is too low, then no significant patterns will be found. But for the sequential rules, no statistical test is done.

      Thus, I think you could look at the number of rules but also on their average support value, and also perhaps minimum and maximum. Perhaps that you can set a threshold that rules having a support less than some given value are ignored because they would not be significant, maybe.

      I did not publish anything on that. I think it is an interesting question to answer. I do not have direct recommendations for papers about that.

      Perhaps also that your problem would be related also to paper about sampling. There are some papers that will evaluate how much transactions or sequence we need to sample from a big database, to be able to still find patterns that are interesting. This may be something to check. I think I saw some papers about that for itemset or association rules, but a similar idea could be applied for sequential rules perhaps.

      Best regards

      • Thank you very much for your detailed response!!! That helped me a lot.
        I will now try to work on it and will contact you again if necessary.

  33. Hi Philippe,

    short question:
    Is there a difference between a sub-sequence and a sequential pattern in the context of SPM ?
    In my opinion, there is one.
    A sub-sequence could represent a sequential pattern, if the sub-sequence satisfy some user-specific support and confidence value.

    So you can also say, that a sub-sequence, that satisfy a min_sup and min_conf, is interesting for the user, and then it´s called a sequential pattern.

    Am i right?

    Thank you in advance.

    Best regards


    • Hi Dominik,

      Yes, I agree with you. You can think of sequential patterns as the subsequences that meet some constraints set by the user. In several papers, researchers will use that terminology. While in some other papers, authors will say for example “frequent sequential patterns” to refer to the sequential patterns (subsequences) that meet the frequency constraint. So here the meaning is that subsequence = sequential pattern.
      I think both ways are OK. But the most important when writing a paper is to be clear and consistent in how the terms are used!

      Best regards,

  34. Dear Sir,

    First of all, I wanted to thank you for all the work that you make available to us.
    I am currently working on algorithms of sequential pattern mining and rule mining. And I had a question about the algorithms provided by your open-source data mining library SPMF.
    Which algorithms are considered deterministic please?

    Yours truly

    • Hi,
      I am glad that the software is useful. I would say that almost all algorithms in SPMF are deterministic. But a few of them are not deterministic such as:
      – k-means
      – all the algorithms that contain PSO (particle swarm optimization) or GA (genetic algorithm) in their names, or are called evolutionary algorithms. This includes: HUIM-GA, HUIM-BPSO, HUIM-BPSO-tree, HUIF-BA, etc.
      – …

      I think that is mostly those algorithms.

      Best regards,


  35. Hello sir ,
    This is Rajshekar Deva,research scholar,can you pls suggest some ideas on vertical association rule mining????I am facing difficulties in finding a topic on that!!

    • Hi, Thanks for reading the blog. It takes some time to find a good topic. There are several papers on association rules. Why not consider sequential rules instead? It is similar but there are much less work… So there is more possibilities to explore new topics. You can look at topics for association rules that do not exist for sequential rules and try to do one of them. Regards,

    • Hello, it means a sequential pattern that has no gap (we cannot skip some items).

      For example, if you have three sequences:
      Sequence 1: ABCDE (which means A followed by B, followed by C, followed by D, followed by E)
      Sequence 2: ABGCD
      Sequence 3: ABGCD

      The pattern AB is a contiguous pattern in Sequence 1,2,3 because A appears immediately before B in each of those sequences.

      But the pattern ABD is NOT a contiguous pattern in sequence 1, 2 and 3 because A is not immediately followed by B and then by D in these sequences (there is some gap – we need to skip some items).

      Hope this is clear.

  36. Hello sir ,
    This is Erna from Indonesia, i am student on doctoral program, i am interesting topic research on association rule, i think i want to find method for determine minimum support value for asscociation rule, so users must not define minimum support value. what your opinion about it? and pls can you give me advice for my research. thank you.

    • Hi,
      Welcome to this blog and thanks for reading. Yes, setting the minimum support threshold or parameters of algorithms is a problem because it is not very intuitive.
      There has been some papers that have proposed some approach to try to solve this problem:
      – top-n itemset mining or top-k pattern mining (it is about the same): In such problem the user set the number of patterns to be found called “k” or “n”. Then the algorithm automatically set the minsup threhsold to ensure that “k” patterns are found. There are various algorithms for top-k pattern mining.
      – skyline pattern mining: In this problem, the user don’t need to use any parameter. The goal is to find patterns that dominate the other patterns. You can search to find the papers about this.

      Those are the research that I think are the most relevant for your topic. But you may find other ways to address this problem or define a new problem. Anything is possible 😉


  37. thank you sir for your suggestion, i have already read some papers about minimum support like multiple minimum support, high utility itemset, etc. ok i think i want to learn paper about skyline pattern mining and top-n itemset mining/top-k pattern mining. thank you for your respon, and maybe i will contact you in the next time.

  38. Hi Philippe,
    Just want to check with you did you receive my email about Trans2Vec? I responded you via your Yahoo email.
    Dang Nguyen

  39. Hi Philippe,
    Thanks you for sharing your information, i learn much with your blog.
    help please
    How to validate the association rules that were found the Apriori Algorithm ?

    • Hi Yonatan,
      Glad the blog is useful 🙂

      I think it depends on the application. Let’s say that you find association rules in medical data. To validate the rules, one way could be to show them to a domain expert (e.g. a doctor) to ask what they think of these rules? Are they surprising? Do they make sense for medical data? Would they be useful? This is from the perspective of some application. I use this a lot in my research papers about pattern mining. If I propose a new type of rule, I will try to show that on real data, I can find some patterns that you would be useful or would make sense from the perspecitve of some application.

      Another way could be to evaluate how these rules are useful to do something specific like making some predictions. If you can use the rules for example to predict who will get sick, then in that context, you could calculate some measures like accuracy and precision to see if your rules are good for prediction.

      But in association rule mining, you do not really need to use the rule for prediction. You can also just use association rule mining to learn some knowledge (patterns) from the data. In fact, association rule mining is a type of unsupervised learning… When you look for association rules, you do not really know exactly what you will find until you find it.

      That is what comes to my mind about evaluating the association rules.

      Best regards,

Leave a Reply

Your email address will not be published. Required fields are marked *