About the author

Philippe Fournier-Viger, Ph.D. is a data mining researcher Philippe Fournier-Vigerand professor.  He use this blog as a personal website to discuss his thoughts, opinions and ideas about data mining and research in general.

He is the founder and main author of the SPMF data mining software, an open-source software offering more than 140 algorithms for discovering itemsets, association rules, sequential patterns and rules in sequences and transactions. The SPMF software has been cited in more than 450 papers and was visited by more than 520,000 visitors since 2010. He has also written or participated in more than 170 research papers which have received more than 1850 citations. He is one of the two editor-in-chief of the Data Science and Pattern Recognition journal.


Website:  http://www.philippe-fournier-viger.com
LinkedIn: http://www.linkedin.com/profile/view?id=63632045&trk=tab_pro
Twitter: https://twitter.com/philfv
ResearchGate: https://www.researchgate.net/profile/Philippe_Fournier-Viger
Google Scholar profile: https://scholar.google.com/citations?user=QG_7KjoAAAAJ&hl=en&oi=sra
DBLP profile: http://dblp.uni-trier.de/pers/hd/f/Fournier=Viger:Philippe

54 Responses to About the author

  1. Aasha.M says:

    please help me in finding a good research topic in using data mining in healthcare

  2. aakansha Saxena says:

    Please give me some research guidance.I want to work on sequential frequent pattern generation.Could you please suggest what can be done

    • There are many possible topics.
      – make a faster algorithm, a more memory efficient algorithm
      – make an algorithm for a variation of the problem such as mining sequential rules, mining uncertain seq. pat., mining seq. pat. in a stream, mining high utility seq. pat. mining weighted seq. patterns, mining fuzzy seq. pat., incremental mining of seq. pat., incremental mining of uncertain seq. pat. , etc. I mean you can combine several ideas and define a new algorithm. Or you can make something faster or more memory efficient.

      I think that the key is to read the recent papers and to see what has been done what has not been done. Then you will get some ideas about what you can do.

      Me, I have a few good ideas related to these topics but it takes time to find a good idea so when I find one, I keep it for myself.

  3. eli says:

    please guide me about Data Mining in Blood Transfusion.I want to work on Blood transfusion Data (Data is include of hcc,hbs and like this tests of Donors)
    please guide me

  4. nasrin says:

    I like to know about a data mining technique /algorithm that will be better for predicting
    certain outcome. eg. selecting appropriate technique for Software requirement Elicitation. I need it for My masters thesis. Please tell me which algorithm is easy to use and also effective and what language to use for the problem.

  5. ahmad says:

    Hello Sir

    i am doing my master in computer engineering and interested in data mining for my thesis. I was thinking about big data but i assume it will take alot of time to finish or do something in this field. and its only masters not phd and i have 6 months to do something and publish a paper. So i am thinking of.. So please suggest me a specific topic which i can do my research and some useful references. It will be really nice of you thanks!

    • I recommend to talk with your supervisor about choosing a topic, and choosing a topic that is not too complicated. Six months is quite short.

      Perhaps that you should try to find some datasets and then do some more experiments with the datasets. Or choose an algorithm and try to improve it. I don’t have too much ideas. Actually, choosing a topic can take a lot of time. Therefore, I cannot do the search for you.

  6. Leonid Aleman Gonzales says:

    please, can you recomended me any good resource for exploring techniques of data mining that evaluates their applications en interesting cases?
    Could integrate a research group that you guide, as I can do?

  7. T Ramesh says:

    Hello sir,
    My name is T Ramesh from india, working as a Asst.Professor in one of the engineering colleges. Recently i have registered for Ph.D in data mining. Currently i am looking for issues in data mining. I thought doing in “Privacy in social networking sites”. Would you please whether it has a scope or not in future? Or would you please any other issues to work on? for this i would be very thankful to you. An also one of friend got registered Ph.D and he has Graph mining as his area of research. Would you please suggest recent issues in graph mining to work on?

    • Hi,

      Graph mining and privacy in social networks are good and popular research areas. So I think that choosing these research areas are good. But you will still need to more precisely define your topic. To do that I cannot help you much since as i said in the blog post, it requires to do a litterature review, which takes times, and I only have time to do that for my students. Just look at recent papers in good conferences and you should find out what are the current challenges.


  8. Manjunatha H C says:

    Hi sir,

    This is Manjunath from Karnataka, india. I have registered for Ph.D at VIT University, Vellore. I’m interested to work on social networks using graph mining. can you please suggest recent research topics on these. i would be very thankful to you if you help me in this regard.


    • To know the recent topics, you need to read papers from the recent conferences and journals on data mining / social network mining. For example, you could have a look at papers published at ASONAM 2014.

  9. Kareem says:

    Hi Philippe,

    May i just say that your article on how to choose a good topic was excellent and useful.
    I was wondering if you could email me as i wish to ask you a private question.

    Kind regards,

  10. Waqas says:

    hi Sir,
    can you please tell me is data set of about 30k is enough for analysis in the thesis of Master level?

    • Even if you have a datasets of 100 or 1k it can be enough. It all depends on what you are doing with the data. If you want to test the performance of an algorithm, you need more data. But if you want to do something more applied such as predicting which students will dropout of school, having just a few hundred students data may be enough for showing that your approach works.

  11. Umar says:

    Hello Dr.I’m preparing to embark on a PhD,and I have always think about applying some data mining to telecommunication data(specifically customer calls).
    I’m thinking about rating/identifying the quality of service delivery by different service providers from their past customer calls data.Or if we can identify a pattern that could distinguish one service provider from the other from their customer calls.
    Could that be sufficient for a PhD research work?
    Thank you

    • Hi, This topic can be good. If you can get some real data, it will be more interesting. In a Ph.D., a student will generally work on a very narrow topic but will become an expert on this narrow topic and investigate it very deeply. So, I don’t see any problem about choosing this. But obviously, you should also do a litterature review to see what other people have done with call/customer data, and to more specifically see what can be done that is original. Good luck!

  12. Ramani says:

    I am trying to find a phd research problem related to discovering (retrieving information) pattern from large scale data (big data) for text mining. Could you please suggest what can be done in this? I am totally confused whether to improve algorithm or just apply the algo in the data set and show the results…..please guide me

    • I think both are possible. It is possible to make a better algorithm (faster, more accurate, etc.). Or it is possible to show that the algorithm gives good results for some particular application of text mining. If you design a better algorithm, you need to be probably better at programming and then your contribution is about algorithm design. If you choose the second option, then your contribution is for a specific application domain, and then you need to better explain how it is useful for your application, and probably also compare with alternative method that do not use patterns to see how your approach improve the result. Or you could do both. This is just some general ideas. You actually need to do a literature review to find a good topic.

  13. Yogalakshmi J says:

    Dear Sir,
    I am Yogalakshmi and I am working on project related to sequence mining. I was exploring SPMF framework. I have some doubts with respect to representation of datasets for sequence mining in general. Is it possible that in all my sequences, I only have a single itemset with many items in it. For e.g:
    1 {A1, A2,A3,A4,A5,A6,A7,A8,A9,A10}
    2 {A4,A5,A11,A12}
    3 {A13,A8,A9,A10,A16,A17,A18,A19,A20}

    • Yes, it is possible to have a single itemset per sequence.

      However, to be considered as a sequences database, items whithin an itemset are assumed to be simultaneous, and an item may not appear twice whithin the same itemset.

      Besides, if you have a single itemset per sequence, there is no notion of before or after anymore. Thus, your sequence database is actually a transaction database, and it may be better in that case to use freuent itemset mining or association rule mining algorithms rather than sequential pattern mining algorithms.


    hai sir

    how to select the research problem in utility mining in data mining, give me some suggettions sir.utility mining with negative item values.
    how to select the problem give some examples and what datasets are used.

    i have one doubt sir in utility mining, who achieved the utility either user or vendor
    tell me some applications of utility mining how they uses the utility mining

    • Hello,

      The traditional example of high-utility itemset mining is related to a retail store. We assume that the utility is the profit that the retail store obtained by selling products. The input is a transaction database, where you have many transactions made by customers. Each transaction indicate the items (that a customer has bought), with the corresponding purchase quantities. For example, a transaction can indidate that a customer has bought three units of beer, and two units of bread. Then, besides that each item has a unit profit (the amount of money that the retail store earn for each unit sold). For example, each beer may generate a 1 $ profit, while each bread may generate a 0.4$ profit. Globally, given a database with many customer transactions, the goal is to find the group of items that generate a high profit for the retail store.

      This is the standard definition of the problem of high-utility itemset mining. However, it is easy to apply this definition in many other domains. For example, you could apply high-utility itemset mining to analyse the webpage viewed by users. In that case the utility could be defined perhaps in terms of the time spend on each page and the importance of webpage. There can be also many other applications. A few of them are listed in the literature.

      You can find many datasets on the SPMF library webpage: http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
      Besides, on the SPMF webpage, you can also find implementations of many of the most popular utility mining algorithms.

      Now, finding a topic requires to read some papers to know what other people are working on. Before choosing a specific topic, you need to read at least a few papers to know what other people are doing. And you should also ask your supervisor to help you find a topic. There are a lot of possibilities. For example, no paper has yet
      addressed the following topics:
      – discovering rare high-utility sequential rules
      – discovering uncertain high-utility sequential rules
      – discovering uncertain sequential rules (not related to high-utility)
      – etc.
      This is just a few example that I can think of. Actually, you can generate a new topic by combining any existing two topics such as:
      rare itemset mining + high-utility sequential rule mining = rare high-utility sequential rule mining.


    hai sir
    what is on-shelf utilty mining give me some examples and applications of onshelf-utilty mining

    • On-shelf high utility mining is similar to high utility itemset mining but we consider the shelf time of products (when the products are on sale and when they are not on sale). For example, imagine that you have a retail store in a country where the weather is hot during the summer and cold during the winter. During the summer, the retail store may sell the swimming suits and other products related to the beach. But during the winter, the retail store may not sell swimming suit. It may sell products related to winter sports. Besides, there are also some other products that are sold during the whole year such as “milk” are bought all year round.

      In on-shelf high-utility itemset mining we want to deal with the information that some products are sold only during some part of the year. The idea is that it is not fair to evaluate all the itemsets (or sets of products) by calculating the profit over a year. The reason is that some products are sold only during 1 months, 2 months… while some products are sold during the whole year for example. Thus, these products don’t have he same opportunity to generate a high profit.

      The different between high-utility itemset mining and on-shelf high utility itemset mining is that the latter consider the time periods where items are sold to provide a more fair measurement of the profit of itemsets. Actually, it is a very interesting problem. And there are several possibilities to extend that.

      You can read my paper about on-shelf high-utility-itemset mining here:
      It will give you all the details.


    hai sir
    i am D.Srinivasa Rao research scholar. please guide me or suggest me to discover the problem in utility mining. shall i proceed in high utility itemset mining with negative item values.or shall i procced in on-shelf utility mining with negative item values. suggest me is it better to proceed in that direction in utility mining.if it is good direction how to select the problem give me some suggestions.

    • Hi,
      I gave you a few suggestions when I replied to your other comment.

      On-shelf high-utility itemset with negative item is an OK topic. But there exist already some algorithms for this. So in that case, either you can extend them further to do more things, or you can try to make a faster algorithms.

  17. Revathi says:

    Hi Sir,
    I am M.Tech student. I am doing project in the field of data mining.I am unable to find the differences among periodic patterns, popular patterns and frequent patterns. When i read about the examples of each patterns , it seems to be differentiate, but i can’t able to differentiate them when comparing. Please give some examples of each patterns, where we can apply these patterns in real world. Suggest me some books related to pattern mining.

    • Hi,
      A periodic pattern is something that appears periodically. For example, if you analyse customer transactions in a super market, you may find that some people buy milk and bread with cheese every week. It is said to be periodic because it appears regularly after some period of time (e.g. every week). That is the basic idea about periodic patterns. Some real work application is to analyze customer transactions. But it could also be applied to analyze many other type of data such as analyzing the electricity consumption of users, etc.

      A frequent pattern is some data that appears regularly in a database. For example, if you analyze again customer transactions you may find that people have frequently bought some items such as salsa sauce with tortillas and beef. Frequent pattern mining has a lot of applications beside market basket analysis. There are hundreds of applications and it would be too long to list all of them. Besides, there are many variations such as finding frequent sequences, frequent sub-graphs, etc.

      By the way, you can also combine different kind of constraints to find patterns. For example, it would be possible to find frequent periodic patterns.

      For a good book about pattern mining, you may “Frequent Pattern Mining” by Charu Aggarwal, which gives a good overview and is quite recent.

      You may also read chapter 4 of “introduction to data mining”, which is quite easy to understand, but is a little bit old though.

  18. Thuy Duong says:

    Dear Phillipe,
    I am studying about sequential pattern data mining from data sequence for my master thesis. So, I really need SNAKE dataset but it not public for me. Could you send me SNAKE dataset? I am very grateful for your support.

  19. Jisna says:

    Hi sir,
    I am a M.Tech student learning about privacy preserving utility mining for thesis. For that i need to extract high utility patterns, given transaction database and external utility table. But i need help for the implementation of the algorithm ( up-growth or up-growth plus). Could you please help?

    • Hello,

      Why do you want to implement UPGrowth? UPGrowth is an old algorithm and much faster algorithms have been proposed. For example, HUI-Miner, FHM and EFIM can be more than 1000 times faster than UPGrowth. Besides, algorithms like HUI-Miner and FHM are easier to implement than UPGrowth.

      If you want to see the source code of these algorithms, you can check the SPMF open-source data mining library which provides the source code in Java.

  20. roopa says:

    Hi , I am trying to use your FP growth code for my transaction. but its not working can you please help me.

  21. Mukesh Patel says:

    I was searching for data mining journal and I found ” Data Science and Pattern recognition”.
    Sir, I have published a paper in a Springer conference which is published in LNCS series.
    Now, I want to publish the extended version of the paper. Can I publish the paper in your journal?..

    Reply as soon as possible..

    • Hello, Yes, you are welcome to submit your extended version. I think that if your paper was published in LCNS it should be quite good. For the extended version, try to add more content such as: more references, more related work, more experiments and in general more details. In general for an extended paper, there should be about 30%-40 % more content than the conference paper. It is best if you can submit the paper before June to be in the next issue.

      If you need suggestions for how to extend the paper or something else related to submitting to our journal, just let me know. You can even send me the paper to my e-mail before submission so that I can give you some advices/suggestions, if you need some.

      Currently, there are not a lot of submissions yet, so I think that you have a high probability of being accepted in the next issue 😉

      • roger that says:

        >> I think that if your paper was published in LCNS(sic) it should be quite good.

        Thanks for your posts. These are informative.
        I was wanting to know how reputed is publishing in LNCS. For instance, take the case of ‘Graphic Recognition’ series in LNCS. If available, is there any references for the same?

        • Hello, Thanks for reading the blog. LNCS and LNAI are series of conference proceedings books published by Springer. Some very good conferences are published in books of the LCNS and LNAI series. But there are also some conferences that are good or just average, which are also published by LNCS and LNAI. But in general, you will never find a very weak conference in that series. LNCS and LNAI ensures some minimum level of quality, although the level is variable.

          I published many papers in LCNS and LNAI because what is good is that those book series are automatically indexed by several publications databases, which gives visibility to the papers. For example, the LNCS/LNAI series are indexed by EI, DBLP, SCOPUS, and several other indexes. Personally, I only publish in Springer, IEEE or ACM for conference papers.

          IEEE is good, but one should be careful that there are the official IEEE conferences and those that are sponsored but not official IEEE conferences. These conferences can be sometimes very weak. With Springer, they are some less good conferences, but I think that the level of the worst Springer conferences is better than the worst IEEE sponsored conferences. Just my opinion.

          After that, you should not just look at the publisher but look at how famous a given conference is. This is actually more important than choosing the publisher.

  22. nani says:

    Hello Sir,

    I would like to analyse the compact prediction tree (+) in depth and also read your papers about them. Can you recommend any further literature?
    I also didn’t understand some points regarding the training sequences. So I would like to know where to I get them to a specific prediction task?

    For example the webpage prediction task. If a user visits webpage A,B,C in that order and now we want to predict which webpage is going to be visited next from that user in order to prefetch the data, on which “training sequences” are we operating so we can make conclusions for the next sequence/item/webpage ?
    I would really appreciate any help.

    Kind regards

    • Hello,
      I think that there are many things that can be improved with regards to the CPT+, in particular by extending it for other problems. So it is a good topic for research.

      About CPT+, there are mainly two papers:

      Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V. S. (2015). CPT+: Decreasing the time/space complexity of the Compact Prediction Tree. Proc. 19th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 2015), Springer, LNAI9078, pp. 625-636.

      Gueniche, T., Fournier-Viger, P., Tseng, V. S. (2013). Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction. Proc. 9th Intern. Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 177-188.

      Besides, you can get a 63 pages Powerpoint explanining CPT here:


      Besides that, you can get the source code in Java of CPT, CPT+ and the other algorithms used in these papers, as well as the datasets used in these papers, in the SPMF library:


      So, if you are not sure how it works, you can also check the code.

      2) For prediction, you need to train the model. If you have data from many users, you can use this to train a model. And then you can use the model to do some predictions either for the same users or for some new users.

      Another way is that if you have a lot of data about a single user, you could train a model specifically for that user and use that model to make some predictions only for that users. Or if you have multiple users, you could create multiple CPT(+) models (one for each user).

      But I think that in real-life you don’t necessarily have a lot of data for each user. So sometimes, you may use the data about all users to train your model. It really depends on what you are doing. For example, if you use CPT+ to predict the next words that someone will type on his cellphone, then you certainly have enough data to train a model for the user, because the user may always be typing on his cellphone, and each sentence could be a sequence.

      So CPT(+) can be used in any of these ways, because in CPT(+) there is no concept of users. There is only the concept of training sequences and testing sequences (for prediction). The training sequences can be either from the same user or different users. It depends what kind of data you have.

      Best regards,

  23. Ameer Abdullah says:

    Sir I have used your software(SPMF) and its fantastic. Bundle of thanks for this. I just need to ask a question: I am using Apriori to generate rules on my dataset. Now i want to check(test) these rules on testing data. Is there any support in your software to do so? If it exists then please tell me, it will be very nice of you.

    • Hello, thanks for using SPMF. The algorithms for discovering rules will exactly find all the rules that you request. For example, if you set minconf = 50 %, then the algorithm will find all the rules having a confidence greater or equal to 50 %. But, yes, after that you may use the rules in many different way to “check (test)” these rules. And how to check if a rule is really good depends on what you want to do with it. For example, it could be used for recommendation or other things. Thus, in SPMF, there is no such function for now. But if someone wants to implement for example classification using association rules, then it could be added to SPMF. Or if someone wants to implement something like clustering using association rules, then it could also be added to SPMF. Best regards,

Leave a Reply

Your email address will not be published. Required fields are marked *