About the author

Philippe Fournier-Viger

Welcome to my blog!

I am Philippe Fournier-Viger, Ph.D., researcher and professor working on data mining, artificial intelligence and their applications.  I use this blog as a personal website to discuss my thoughts, opinions and ideas about data mining and research in general.

I am the founder and main author of the SPMF data mining software, an open-source software offering more than 250 algorithms, specialized in pattern mining. It offers algorithms for many tasks such as for discovering frequent itemsets, high utility itemsets, episodes, association rules, sequential patterns and sequential rules in sequences, graphs, and transactions.

The SPMF software has been cited in more than 1000 papers and was visited by more than 1,000,000 visitors since 2010. I have participated to more than 350 research articles which have received more than 11,400 citations. I am editor-in-chief of the Data Science and Pattern Recognition journal, and former associate editor-in-chief of the Applied Intelligence journal (Springer, SCI Q2).  I have edited or co-edited several books such as “High-Utility Pattern Mining” (Springer). I was keynote speaker for more than 15 international conferences such as MIWAI 2019, ICGEC 2018, CCNS 2020, IAECST 2000, ICAICE 2020 and IFID 2020.


61 Responses to About the author

  1. Aasha.M says:

    please help me in finding a good research topic in using data mining in healthcare

  2. aakansha Saxena says:

    Please give me some research guidance.I want to work on sequential frequent pattern generation.Could you please suggest what can be done

    • There are many possible topics.
      – make a faster algorithm, a more memory efficient algorithm
      – make an algorithm for a variation of the problem such as mining sequential rules, mining uncertain seq. pat., mining seq. pat. in a stream, mining high utility seq. pat. mining weighted seq. patterns, mining fuzzy seq. pat., incremental mining of seq. pat., incremental mining of uncertain seq. pat. , etc. I mean you can combine several ideas and define a new algorithm. Or you can make something faster or more memory efficient.

      I think that the key is to read the recent papers and to see what has been done what has not been done. Then you will get some ideas about what you can do.

      Me, I have a few good ideas related to these topics but it takes time to find a good idea so when I find one, I keep it for myself.

  3. eli says:

    please guide me about Data Mining in Blood Transfusion.I want to work on Blood transfusion Data (Data is include of hcc,hbs and like this tests of Donors)
    please guide me

  4. nasrin says:

    I like to know about a data mining technique /algorithm that will be better for predicting
    certain outcome. eg. selecting appropriate technique for Software requirement Elicitation. I need it for My masters thesis. Please tell me which algorithm is easy to use and also effective and what language to use for the problem.

  5. ahmad says:

    Hello Sir

    i am doing my master in computer engineering and interested in data mining for my thesis. I was thinking about big data but i assume it will take alot of time to finish or do something in this field. and its only masters not phd and i have 6 months to do something and publish a paper. So i am thinking of.. So please suggest me a specific topic which i can do my research and some useful references. It will be really nice of you thanks!

    • I recommend to talk with your supervisor about choosing a topic, and choosing a topic that is not too complicated. Six months is quite short.

      Perhaps that you should try to find some datasets and then do some more experiments with the datasets. Or choose an algorithm and try to improve it. I don’t have too much ideas. Actually, choosing a topic can take a lot of time. Therefore, I cannot do the search for you.

  6. Leonid Aleman Gonzales says:

    please, can you recomended me any good resource for exploring techniques of data mining that evaluates their applications en interesting cases?
    Could integrate a research group that you guide, as I can do?

  7. T Ramesh says:

    Hello sir,
    My name is T Ramesh from india, working as a Asst.Professor in one of the engineering colleges. Recently i have registered for Ph.D in data mining. Currently i am looking for issues in data mining. I thought doing in “Privacy in social networking sites”. Would you please whether it has a scope or not in future? Or would you please any other issues to work on? for this i would be very thankful to you. An also one of friend got registered Ph.D and he has Graph mining as his area of research. Would you please suggest recent issues in graph mining to work on?

    • Hi,

      Graph mining and privacy in social networks are good and popular research areas. So I think that choosing these research areas are good. But you will still need to more precisely define your topic. To do that I cannot help you much since as i said in the blog post, it requires to do a litterature review, which takes times, and I only have time to do that for my students. Just look at recent papers in good conferences and you should find out what are the current challenges.


  8. Manjunatha H C says:

    Hi sir,

    This is Manjunath from Karnataka, india. I have registered for Ph.D at VIT University, Vellore. I’m interested to work on social networks using graph mining. can you please suggest recent research topics on these. i would be very thankful to you if you help me in this regard.


    • To know the recent topics, you need to read papers from the recent conferences and journals on data mining / social network mining. For example, you could have a look at papers published at ASONAM 2014.

  9. Kareem says:

    Hi Philippe,

    May i just say that your article on how to choose a good topic was excellent and useful.
    I was wondering if you could email me as i wish to ask you a private question.

    Kind regards,

  10. Waqas says:

    hi Sir,
    can you please tell me is data set of about 30k is enough for analysis in the thesis of Master level?

    • Even if you have a datasets of 100 or 1k it can be enough. It all depends on what you are doing with the data. If you want to test the performance of an algorithm, you need more data. But if you want to do something more applied such as predicting which students will dropout of school, having just a few hundred students data may be enough for showing that your approach works.

  11. Umar says:

    Hello Dr.I’m preparing to embark on a PhD,and I have always think about applying some data mining to telecommunication data(specifically customer calls).
    I’m thinking about rating/identifying the quality of service delivery by different service providers from their past customer calls data.Or if we can identify a pattern that could distinguish one service provider from the other from their customer calls.
    Could that be sufficient for a PhD research work?
    Thank you

    • Hi, This topic can be good. If you can get some real data, it will be more interesting. In a Ph.D., a student will generally work on a very narrow topic but will become an expert on this narrow topic and investigate it very deeply. So, I don’t see any problem about choosing this. But obviously, you should also do a litterature review to see what other people have done with call/customer data, and to more specifically see what can be done that is original. Good luck!

  12. Ramani says:

    I am trying to find a phd research problem related to discovering (retrieving information) pattern from large scale data (big data) for text mining. Could you please suggest what can be done in this? I am totally confused whether to improve algorithm or just apply the algo in the data set and show the results…..please guide me

    • I think both are possible. It is possible to make a better algorithm (faster, more accurate, etc.). Or it is possible to show that the algorithm gives good results for some particular application of text mining. If you design a better algorithm, you need to be probably better at programming and then your contribution is about algorithm design. If you choose the second option, then your contribution is for a specific application domain, and then you need to better explain how it is useful for your application, and probably also compare with alternative method that do not use patterns to see how your approach improve the result. Or you could do both. This is just some general ideas. You actually need to do a literature review to find a good topic.

  13. Yogalakshmi J says:

    Dear Sir,
    I am Yogalakshmi and I am working on project related to sequence mining. I was exploring SPMF framework. I have some doubts with respect to representation of datasets for sequence mining in general. Is it possible that in all my sequences, I only have a single itemset with many items in it. For e.g:
    1 {A1, A2,A3,A4,A5,A6,A7,A8,A9,A10}
    2 {A4,A5,A11,A12}
    3 {A13,A8,A9,A10,A16,A17,A18,A19,A20}

    • Yes, it is possible to have a single itemset per sequence.

      However, to be considered as a sequences database, items whithin an itemset are assumed to be simultaneous, and an item may not appear twice whithin the same itemset.

      Besides, if you have a single itemset per sequence, there is no notion of before or after anymore. Thus, your sequence database is actually a transaction database, and it may be better in that case to use freuent itemset mining or association rule mining algorithms rather than sequential pattern mining algorithms.


    hai sir

    how to select the research problem in utility mining in data mining, give me some suggettions sir.utility mining with negative item values.
    how to select the problem give some examples and what datasets are used.

    i have one doubt sir in utility mining, who achieved the utility either user or vendor
    tell me some applications of utility mining how they uses the utility mining

    • Hello,

      The traditional example of high-utility itemset mining is related to a retail store. We assume that the utility is the profit that the retail store obtained by selling products. The input is a transaction database, where you have many transactions made by customers. Each transaction indicate the items (that a customer has bought), with the corresponding purchase quantities. For example, a transaction can indidate that a customer has bought three units of beer, and two units of bread. Then, besides that each item has a unit profit (the amount of money that the retail store earn for each unit sold). For example, each beer may generate a 1 $ profit, while each bread may generate a 0.4$ profit. Globally, given a database with many customer transactions, the goal is to find the group of items that generate a high profit for the retail store.

      This is the standard definition of the problem of high-utility itemset mining. However, it is easy to apply this definition in many other domains. For example, you could apply high-utility itemset mining to analyse the webpage viewed by users. In that case the utility could be defined perhaps in terms of the time spend on each page and the importance of webpage. There can be also many other applications. A few of them are listed in the literature.

      You can find many datasets on the SPMF library webpage: http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
      Besides, on the SPMF webpage, you can also find implementations of many of the most popular utility mining algorithms.

      Now, finding a topic requires to read some papers to know what other people are working on. Before choosing a specific topic, you need to read at least a few papers to know what other people are doing. And you should also ask your supervisor to help you find a topic. There are a lot of possibilities. For example, no paper has yet
      addressed the following topics:
      – discovering rare high-utility sequential rules
      – discovering uncertain high-utility sequential rules
      – discovering uncertain sequential rules (not related to high-utility)
      – etc.
      This is just a few example that I can think of. Actually, you can generate a new topic by combining any existing two topics such as:
      rare itemset mining + high-utility sequential rule mining = rare high-utility sequential rule mining.


    hai sir
    what is on-shelf utilty mining give me some examples and applications of onshelf-utilty mining

    • On-shelf high utility mining is similar to high utility itemset mining but we consider the shelf time of products (when the products are on sale and when they are not on sale). For example, imagine that you have a retail store in a country where the weather is hot during the summer and cold during the winter. During the summer, the retail store may sell the swimming suits and other products related to the beach. But during the winter, the retail store may not sell swimming suit. It may sell products related to winter sports. Besides, there are also some other products that are sold during the whole year such as “milk” are bought all year round.

      In on-shelf high-utility itemset mining we want to deal with the information that some products are sold only during some part of the year. The idea is that it is not fair to evaluate all the itemsets (or sets of products) by calculating the profit over a year. The reason is that some products are sold only during 1 months, 2 months… while some products are sold during the whole year for example. Thus, these products don’t have he same opportunity to generate a high profit.

      The different between high-utility itemset mining and on-shelf high utility itemset mining is that the latter consider the time periods where items are sold to provide a more fair measurement of the profit of itemsets. Actually, it is a very interesting problem. And there are several possibilities to extend that.

      You can read my paper about on-shelf high-utility-itemset mining here:
      It will give you all the details.


    hai sir
    i am D.Srinivasa Rao research scholar. please guide me or suggest me to discover the problem in utility mining. shall i proceed in high utility itemset mining with negative item values.or shall i procced in on-shelf utility mining with negative item values. suggest me is it better to proceed in that direction in utility mining.if it is good direction how to select the problem give me some suggestions.

    • Hi,
      I gave you a few suggestions when I replied to your other comment.

      On-shelf high-utility itemset with negative item is an OK topic. But there exist already some algorithms for this. So in that case, either you can extend them further to do more things, or you can try to make a faster algorithms.

  17. Revathi says:

    Hi Sir,
    I am M.Tech student. I am doing project in the field of data mining.I am unable to find the differences among periodic patterns, popular patterns and frequent patterns. When i read about the examples of each patterns , it seems to be differentiate, but i can’t able to differentiate them when comparing. Please give some examples of each patterns, where we can apply these patterns in real world. Suggest me some books related to pattern mining.

    • Hi,
      A periodic pattern is something that appears periodically. For example, if you analyse customer transactions in a super market, you may find that some people buy milk and bread with cheese every week. It is said to be periodic because it appears regularly after some period of time (e.g. every week). That is the basic idea about periodic patterns. Some real work application is to analyze customer transactions. But it could also be applied to analyze many other type of data such as analyzing the electricity consumption of users, etc.

      A frequent pattern is some data that appears regularly in a database. For example, if you analyze again customer transactions you may find that people have frequently bought some items such as salsa sauce with tortillas and beef. Frequent pattern mining has a lot of applications beside market basket analysis. There are hundreds of applications and it would be too long to list all of them. Besides, there are many variations such as finding frequent sequences, frequent sub-graphs, etc.

      By the way, you can also combine different kind of constraints to find patterns. For example, it would be possible to find frequent periodic patterns.

      For a good book about pattern mining, you may “Frequent Pattern Mining” by Charu Aggarwal, which gives a good overview and is quite recent.

      You may also read chapter 4 of “introduction to data mining”, which is quite easy to understand, but is a little bit old though.

  18. Thuy Duong says:

    Dear Phillipe,
    I am studying about sequential pattern data mining from data sequence for my master thesis. So, I really need SNAKE dataset but it not public for me. Could you send me SNAKE dataset? I am very grateful for your support.

  19. Cindy says:

    Hello sir

    Can you help me find relevant information on ethical theories regarding to the issue
    “should adultery be criminalized?.
    It for my research assignment.

  20. Zehong Zheng says:

    Dear Professor Philippe Fournier-Viger:
    Thank you for reading!
    Recently, I ran into some problems when doing sequential pattern mining in environmental protection. I do not how to transform raw database into sequential database. Can you give me some ideas based on your experience? Thank you!

  21. Saad Khan says:

    Sir i need KHMC algorithm java source code for my project. Please help me out with this

  22. Saad Khan says:

    Hello Sir, i need java source code of KHMC algorithm for my project, it has been developed to mine Top-K high utility itemsets

  23. Saad Khan says:

    Hello Sir, i need KHMC algorithm java source code for my project. Can you please help me out with this?

  24. Sirisha A says:

    Respected Sir,
    I am a research student and I refer to the SPMF for the datasets and algorithms. Recently I have studied your paper titled ” Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams” .

    In this paper you have mentioned that each book of an author will be converted into a sequence database of POS tags by performing pre-processing using the
    Rita NLP library (http://www.rednoise.org/rita/) and the Stanford NLP Tagger (http://nlp.stanford.edu/software/).

    In Stanford NLP tagger there are only 36 POS tags but i have found 38 POS tags and the pos tags are also different from the Stanford NLP Tagger ,while referring to the spmf dataset in the following link :

    The spmf dataset has single letters as pos tags which are not found in the Stanford NLP tagger . I have enclosed below the pos tags found in your dataset and the Stanford NLP tagger , kindly tell me the difference between them.
    POS Tags found in the SPMF dataset are :
    POS tags found in Stanford NLP Tagger
    1. CC Coordinating conjunction
    2. CD Cardinal number
    3. DT Determiner
    4. EX Existential there
    5. FW Foreign word
    6. IN Preposition or subordinating conjunction
    7. JJ Adjective
    8. JJR Adjective, comparative
    9. JJS Adjective, superlative
    10. LS List item marker
    11. MD Modal
    12. NN Noun, singular or mass
    13. NNS Noun, plural
    14. NNP Proper noun, singular
    15. NNPS Proper noun, plural
    16. PDT Predeterminer
    17. POS Possessive ending
    18. PRP Personal pronoun
    19. PRP$ Possessive pronoun
    20. RB Adverb
    21. RBR Adverb, comparative
    22. RBS Adverb, superlative
    23. RP Particle
    24. SYM Symbol
    25. TO to
    26. UH Interjection
    27. VB Verb, base form
    28. VBD Verb, past tense
    29. VBG Verb, gerund or present participle
    30. VBN Verb, past participle
    31. VBP Verb, non¬3rd person singular present
    32. VBZ Verb, 3rd person singular present
    33. WDT Wh¬determiner
    34. WP Wh¬pronoun
    35. WP$ Possessive wh¬pronoun
    36. WRB Wh¬adverb

    After the POS tagging the each book has to be represented as a set of sequences of pos tags. but in the above dataset link the entire book is represented as a single sequence. I found -2 only once at the end in the dataset (As per the spmf format -2 represents the end of a sequence ) and was Unable to identify different sentences as different sequences.

    In order to find top k-sequential patterns (pos tag patterns) in a book , I need to represent a book as a sequence database. but according to the link the entire book is representing a single sequence. Do the TKS algorithm be applicable in this case?

    So, Kindly help me to understand how you applied the TKS algorithm to a book to find the top k-pos tag patterns. Give me clarity in this regard sir.

    If possible kindly share me the pre-processing phase code used for the above paper and also you mentioned that you have modified the TKS algorithm for this paper if so kindly share that code also to understand it thoroughly.

    Thank you.

    • Good afternoon,

      Thanks for reading the blog, the papers, and using the software. And I am sorry to answer a little bit late. It is currently the summer vacation and I did not check the blog for a few days.

      I see what you mean. I will check the files and look for the code and send you an e-mail to your gmail address that you used here.

      Best regards,

  25. Yann Eric says:

    Bonjour Philippe, merci pour le contenu de ton blog et aussi de ta chaine youtube
    c’est un énorme plaisir de les consulter.

    je réécris moi même un algorithme pour obtenir des règles d’association actuellement
    et je butte sur une question de choix.

    je dois choisir entre prendre toutes les commandes pour l’algo et prendre que les commandes où le client a acheté plus de 2 produits.
    je perd de l’information sur le support des produits les plus importants si je prend le deuxième choix mais j’hésite encore.
    Avez vous des pistes pour moi .

    j’ai pas de problème de ressources car j’écris en SQL sur le cloud de google (bigquery).

    • Bonjour Yann,

      Je m’excuses du délai de réponse. Je suis présentement en vacances et un peu débordé par le travail, malgré tout. Je n’ai donc pas regarder les commentaires du blog depuis quelques temps. Merci de suivre le blog et la chaîne youtube. C’est très apprécié!

      Maintenant pour répondre à la question, je ne suis pas qu’il y ait un choix qui soit meilleur que l’autre, mais on peut discuter du résultat.

      Approche 1:
      – Le support va être calculé de façon exacte.
      – Les transactions à un ou deux items ne sont peut-être pas tellement intéressantes. Cela dépend de l’application. Si c’est des achats dans un magasins, cela peut sembler raisonnable, mais encore il faudrait peut-être voir les résultats que cela donne d’un point de vue pratique et analyser cela pour vérifier.

      Approche 2:
      – Enlever les transactions de 1 ou 2 items devrait accélérer l’algorithme. Par contre, comme tu me dis, cela n’est peut-être pas un priorité pour toi. Donc pourquoi vouloir éliminer ces transactions? Je n’ai peut-être pas bien saisi. Je pense que l’avantage principale serait d’améliorer la performance.

      Maintenant, j’aimerais parler un peu de façon plus générale. En général, si c’est pour une application comme analyser ce que les consommateurs achètent dans un magasin, il n’est peut-être pas nécessaire d’avoir des règles d’association contenant beaucoup d’items. Et peut-être que les règles avec seulement un item dans le conséquent serait suffisante pour ce que tu fais. Cela dépend en fait un peu de ton but, c’est-

  26. Jean-Francois Bernier says:

    Hello ! Thanks for your numerous contributions to this field of data science ! After reading many papers on the topic, I still can’t wrap my head around what’s the difference between sequence mining and episode mining. Those two approaches are strikingly similar at a bird-eye view, they seem to ingest roughly the same kind of sequences as input, and have very similar parameters and functionalities (like windowing) and yet the interpretation of the results seem to differ slightly. I was wondering if you could clarify the differences between those two types of problems, and in which context one would be preferable over the other. Thank you !

    • Goood afternoon,

      Thanks for your comment. Yes, those two topics are indeed very similar. The main difference is that in episode mining the input is a single sequence that is very long, while in sequential pattern mining, the input is multiple sequences (a sequence database).

      Then, in episode mining, the goal is to find some subsequences that appear many time in a long sequence.
      And, in sequential pattern mining, the goal is to find some subsequences that are common to many sequences.

      For example, let’s say that we have a very long sequence of locations visited by a person in a city. We could apply episode mining to find some sequences of locations often visited by that person.

      Now let’s say that you have the sequences of locations visited by hundreds of people in a city. Then you could apply sequential pattern mining to find some sequences of locations that are common to many persons.

      So that is the main difference between those two tasks! Hope it is clearer. Thanks for reading the blog!

  27. K. P. Birla says:

    Thank you for providing wonderful resources to study HUPM on spmf.
    Please provide the updated map-figure of https://www.philippe-fournier-viger.com/spmf/map_algorithms_spmf_data_mining097.png

    • Thanks for your comment! I am currently working on the new version of SPMF and making several improvements that will be released soon!

      I appreciate your suggestion and will try to update the map soon as well. Best regards,

Leave a Reply

Your email address will not be published. Required fields are marked *