About the author

Philippe Fournier-Viger

Welcome to my blog!

I am Philippe Fournier-Viger, Ph.D., researcher and professor working on data mining, artificial intelligence and their applications.  I use this blog as a personal website to discuss my thoughts, opinions and ideas about data mining and research in general.

I am the founder and main author of the SPMF data mining software, an open-source software offering more than 250 algorithms, specialized in pattern mining. It offers algorithms for many tasks such as for discovering frequent itemsets, high utility itemsets, episodes, association rules, sequential patterns and sequential rules in sequences, graphs, and transactions.

The SPMF software has been cited in more than 1000 papers and was visited by more than 1,000,000 visitors since 2010. I have participated to more than 350 research articles which have received more than 11,400 citations. I am editor-in-chief of the Data Science and Pattern Recognition journal, and former associate editor-in-chief of the Applied Intelligence journal (Springer, SCI Q2).  I have edited or co-edited several books such as “High-Utility Pattern Mining” (Springer). I was keynote speaker for more than 15 international conferences such as MIWAI 2019, ICGEC 2018, CCNS 2020, IAECST 2000, ICAICE 2020 and IFID 2020.

Links:

68 Responses to About the author

  1. Aasha.M says:

    please help me in finding a good research topic in using data mining in healthcare

  2. aakansha Saxena says:

    Please give me some research guidance.I want to work on sequential frequent pattern generation.Could you please suggest what can be done

    • There are many possible topics.
      – make a faster algorithm, a more memory efficient algorithm
      – make an algorithm for a variation of the problem such as mining sequential rules, mining uncertain seq. pat., mining seq. pat. in a stream, mining high utility seq. pat. mining weighted seq. patterns, mining fuzzy seq. pat., incremental mining of seq. pat., incremental mining of uncertain seq. pat. , etc. I mean you can combine several ideas and define a new algorithm. Or you can make something faster or more memory efficient.

      I think that the key is to read the recent papers and to see what has been done what has not been done. Then you will get some ideas about what you can do.

      Me, I have a few good ideas related to these topics but it takes time to find a good idea so when I find one, I keep it for myself.

  3. eli says:

    please guide me about Data Mining in Blood Transfusion.I want to work on Blood transfusion Data (Data is include of hcc,hbs and like this tests of Donors)
    please guide me

  4. nasrin says:

    I like to know about a data mining technique /algorithm that will be better for predicting
    certain outcome. eg. selecting appropriate technique for Software requirement Elicitation. I need it for My masters thesis. Please tell me which algorithm is easy to use and also effective and what language to use for the problem.

  5. ahmad says:

    Hello Sir

    i am doing my master in computer engineering and interested in data mining for my thesis. I was thinking about big data but i assume it will take alot of time to finish or do something in this field. and its only masters not phd and i have 6 months to do something and publish a paper. So i am thinking of.. So please suggest me a specific topic which i can do my research and some useful references. It will be really nice of you thanks!

    • I recommend to talk with your supervisor about choosing a topic, and choosing a topic that is not too complicated. Six months is quite short.

      Perhaps that you should try to find some datasets and then do some more experiments with the datasets. Or choose an algorithm and try to improve it. I don’t have too much ideas. Actually, choosing a topic can take a lot of time. Therefore, I cannot do the search for you.

  6. Leonid Aleman Gonzales says:

    please, can you recomended me any good resource for exploring techniques of data mining that evaluates their applications en interesting cases?
    Could integrate a research group that you guide, as I can do?

  7. T Ramesh says:

    Hello sir,
    My name is T Ramesh from india, working as a Asst.Professor in one of the engineering colleges. Recently i have registered for Ph.D in data mining. Currently i am looking for issues in data mining. I thought doing in “Privacy in social networking sites”. Would you please whether it has a scope or not in future? Or would you please any other issues to work on? for this i would be very thankful to you. An also one of friend got registered Ph.D and he has Graph mining as his area of research. Would you please suggest recent issues in graph mining to work on?

    • Hi,

      Graph mining and privacy in social networks are good and popular research areas. So I think that choosing these research areas are good. But you will still need to more precisely define your topic. To do that I cannot help you much since as i said in the blog post, it requires to do a litterature review, which takes times, and I only have time to do that for my students. Just look at recent papers in good conferences and you should find out what are the current challenges.

      Best,

  8. Manjunatha H C says:

    Hi sir,

    This is Manjunath from Karnataka, india. I have registered for Ph.D at VIT University, Vellore. I’m interested to work on social networks using graph mining. can you please suggest recent research topics on these. i would be very thankful to you if you help me in this regard.

    -Manju

    • To know the recent topics, you need to read papers from the recent conferences and journals on data mining / social network mining. For example, you could have a look at papers published at ASONAM 2014.

  9. Kareem says:

    Hi Philippe,

    May i just say that your article on how to choose a good topic was excellent and useful.
    I was wondering if you could email me as i wish to ask you a private question.

    Kind regards,
    Kareem

  10. Waqas says:

    hi Sir,
    can you please tell me is data set of about 30k is enough for analysis in the thesis of Master level?

    • Even if you have a datasets of 100 or 1k it can be enough. It all depends on what you are doing with the data. If you want to test the performance of an algorithm, you need more data. But if you want to do something more applied such as predicting which students will dropout of school, having just a few hundred students data may be enough for showing that your approach works.

  11. Umar says:

    Hello Dr.I’m preparing to embark on a PhD,and I have always think about applying some data mining to telecommunication data(specifically customer calls).
    I’m thinking about rating/identifying the quality of service delivery by different service providers from their past customer calls data.Or if we can identify a pattern that could distinguish one service provider from the other from their customer calls.
    Could that be sufficient for a PhD research work?
    Thank you

    • Hi, This topic can be good. If you can get some real data, it will be more interesting. In a Ph.D., a student will generally work on a very narrow topic but will become an expert on this narrow topic and investigate it very deeply. So, I don’t see any problem about choosing this. But obviously, you should also do a litterature review to see what other people have done with call/customer data, and to more specifically see what can be done that is original. Good luck!

  12. Ramani says:

    Hi,
    I am trying to find a phd research problem related to discovering (retrieving information) pattern from large scale data (big data) for text mining. Could you please suggest what can be done in this? I am totally confused whether to improve algorithm or just apply the algo in the data set and show the results…..please guide me

    • I think both are possible. It is possible to make a better algorithm (faster, more accurate, etc.). Or it is possible to show that the algorithm gives good results for some particular application of text mining. If you design a better algorithm, you need to be probably better at programming and then your contribution is about algorithm design. If you choose the second option, then your contribution is for a specific application domain, and then you need to better explain how it is useful for your application, and probably also compare with alternative method that do not use patterns to see how your approach improve the result. Or you could do both. This is just some general ideas. You actually need to do a literature review to find a good topic.

  13. Yogalakshmi J says:

    Dear Sir,
    I am Yogalakshmi and I am working on project related to sequence mining. I was exploring SPMF framework. I have some doubts with respect to representation of datasets for sequence mining in general. Is it possible that in all my sequences, I only have a single itemset with many items in it. For e.g:
    1 {A1, A2,A3,A4,A5,A6,A7,A8,A9,A10}
    2 {A4,A5,A11,A12}
    3 {A13,A8,A9,A10,A16,A17,A18,A19,A20}

    • Yes, it is possible to have a single itemset per sequence.

      However, to be considered as a sequences database, items whithin an itemset are assumed to be simultaneous, and an item may not appear twice whithin the same itemset.

      Besides, if you have a single itemset per sequence, there is no notion of before or after anymore. Thus, your sequence database is actually a transaction database, and it may be better in that case to use freuent itemset mining or association rule mining algorithms rather than sequential pattern mining algorithms.

  14. DIVVELA SRINIVASA RAO says:

    hai sir

    how to select the research problem in utility mining in data mining, give me some suggettions sir.utility mining with negative item values.
    how to select the problem give some examples and what datasets are used.

    i have one doubt sir in utility mining, who achieved the utility either user or vendor
    tell me some applications of utility mining how they uses the utility mining

    • Hello,

      The traditional example of high-utility itemset mining is related to a retail store. We assume that the utility is the profit that the retail store obtained by selling products. The input is a transaction database, where you have many transactions made by customers. Each transaction indicate the items (that a customer has bought), with the corresponding purchase quantities. For example, a transaction can indidate that a customer has bought three units of beer, and two units of bread. Then, besides that each item has a unit profit (the amount of money that the retail store earn for each unit sold). For example, each beer may generate a 1 $ profit, while each bread may generate a 0.4$ profit. Globally, given a database with many customer transactions, the goal is to find the group of items that generate a high profit for the retail store.

      This is the standard definition of the problem of high-utility itemset mining. However, it is easy to apply this definition in many other domains. For example, you could apply high-utility itemset mining to analyse the webpage viewed by users. In that case the utility could be defined perhaps in terms of the time spend on each page and the importance of webpage. There can be also many other applications. A few of them are listed in the literature.

      You can find many datasets on the SPMF library webpage: http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
      Besides, on the SPMF webpage, you can also find implementations of many of the most popular utility mining algorithms.

      Now, finding a topic requires to read some papers to know what other people are working on. Before choosing a specific topic, you need to read at least a few papers to know what other people are doing. And you should also ask your supervisor to help you find a topic. There are a lot of possibilities. For example, no paper has yet
      addressed the following topics:
      – discovering rare high-utility sequential rules
      – discovering uncertain high-utility sequential rules
      – discovering uncertain sequential rules (not related to high-utility)
      – etc.
      This is just a few example that I can think of. Actually, you can generate a new topic by combining any existing two topics such as:
      rare itemset mining + high-utility sequential rule mining = rare high-utility sequential rule mining.

  15. DIVVELA SRINIVASA RAO says:

    hai sir
    what is on-shelf utilty mining give me some examples and applications of onshelf-utilty mining

    • On-shelf high utility mining is similar to high utility itemset mining but we consider the shelf time of products (when the products are on sale and when they are not on sale). For example, imagine that you have a retail store in a country where the weather is hot during the summer and cold during the winter. During the summer, the retail store may sell the swimming suits and other products related to the beach. But during the winter, the retail store may not sell swimming suit. It may sell products related to winter sports. Besides, there are also some other products that are sold during the whole year such as “milk” are bought all year round.

      In on-shelf high-utility itemset mining we want to deal with the information that some products are sold only during some part of the year. The idea is that it is not fair to evaluate all the itemsets (or sets of products) by calculating the profit over a year. The reason is that some products are sold only during 1 months, 2 months… while some products are sold during the whole year for example. Thus, these products don’t have he same opportunity to generate a high profit.

      The different between high-utility itemset mining and on-shelf high utility itemset mining is that the latter consider the time periods where items are sold to provide a more fair measurement of the profit of itemsets. Actually, it is a very interesting problem. And there are several possibilities to extend that.

      You can read my paper about on-shelf high-utility-itemset mining here:
      http://www.philippe-fournier-viger.com/FOSHU_on-shelf-high-utility-itemset-mining.pdf
      It will give you all the details.

  16. DIVVELA SRINIVASA RAO says:

    hai sir
    i am D.Srinivasa Rao research scholar. please guide me or suggest me to discover the problem in utility mining. shall i proceed in high utility itemset mining with negative item values.or shall i procced in on-shelf utility mining with negative item values. suggest me is it better to proceed in that direction in utility mining.if it is good direction how to select the problem give me some suggestions.

    • Hi,
      I gave you a few suggestions when I replied to your other comment.

      On-shelf high-utility itemset with negative item is an OK topic. But there exist already some algorithms for this. So in that case, either you can extend them further to do more things, or you can try to make a faster algorithms.

  17. Revathi says:

    Hi Sir,
    I am M.Tech student. I am doing project in the field of data mining.I am unable to find the differences among periodic patterns, popular patterns and frequent patterns. When i read about the examples of each patterns , it seems to be differentiate, but i can’t able to differentiate them when comparing. Please give some examples of each patterns, where we can apply these patterns in real world. Suggest me some books related to pattern mining.

    • Hi,
      A periodic pattern is something that appears periodically. For example, if you analyse customer transactions in a super market, you may find that some people buy milk and bread with cheese every week. It is said to be periodic because it appears regularly after some period of time (e.g. every week). That is the basic idea about periodic patterns. Some real work application is to analyze customer transactions. But it could also be applied to analyze many other type of data such as analyzing the electricity consumption of users, etc.

      A frequent pattern is some data that appears regularly in a database. For example, if you analyze again customer transactions you may find that people have frequently bought some items such as salsa sauce with tortillas and beef. Frequent pattern mining has a lot of applications beside market basket analysis. There are hundreds of applications and it would be too long to list all of them. Besides, there are many variations such as finding frequent sequences, frequent sub-graphs, etc.

      By the way, you can also combine different kind of constraints to find patterns. For example, it would be possible to find frequent periodic patterns.

      For a good book about pattern mining, you may “Frequent Pattern Mining” by Charu Aggarwal, which gives a good overview and is quite recent.

      You may also read chapter 4 of “introduction to data mining”, which is quite easy to understand, but is a little bit old though.

  18. Thuy Duong says:

    Dear Phillipe,
    I am studying about sequential pattern data mining from data sequence for my master thesis. So, I really need SNAKE dataset but it not public for me. Could you send me SNAKE dataset? I am very grateful for your support.

  19. Cindy says:

    Hello sir

    Can you help me find relevant information on ethical theories regarding to the issue
    “should adultery be criminalized?.
    It for my research assignment.

  20. Zehong Zheng says:

    Dear Professor Philippe Fournier-Viger:
    Thank you for reading!
    Recently, I ran into some problems when doing sequential pattern mining in environmental protection. I do not how to transform raw database into sequential database. Can you give me some ideas based on your experience? Thank you!

  21. Saad Khan says:

    Sir i need KHMC algorithm java source code for my project. Please help me out with this

  22. Saad Khan says:

    Hello Sir, i need java source code of KHMC algorithm for my project, it has been developed to mine Top-K high utility itemsets

  23. Saad Khan says:

    Hello Sir, i need KHMC algorithm java source code for my project. Can you please help me out with this?

  24. Sirisha A says:

    Respected Sir,
    I am a research student and I refer to the SPMF for the datasets and algorithms. Recently I have studied your paper titled ” Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams” .

    In this paper you have mentioned that each book of an author will be converted into a sequence database of POS tags by performing pre-processing using the
    Rita NLP library (http://www.rednoise.org/rita/) and the Stanford NLP Tagger (http://nlp.stanford.edu/software/).

    In Stanford NLP tagger there are only 36 POS tags but i have found 38 POS tags and the pos tags are also different from the Stanford NLP Tagger ,while referring to the spmf dataset in the following link :
    https://www.philippe-fournier-viger.com/spmf/datasets/pokou/spmfformat3/A%20Tale%20of%20The%20Rice%20Lake%20Plains%20by%20Catharine%20Traill.txt.txt

    The spmf dataset has single letters as pos tags which are not found in the Stanford NLP tagger . I have enclosed below the pos tags found in your dataset and the Stanford NLP tagger , kindly tell me the difference between them.
    POS Tags found in the SPMF dataset are :
    @ITEM=1=dt
    @ITEM=2=jj
    @ITEM=3=cc
    @ITEM=4=rb
    @ITEM=5=nn
    @ITEM=6=vbz
    @ITEM=7=vbn
    @ITEM=8=in
    @ITEM=9=nns
    @ITEM=10=wrb
    @ITEM=11=prp
    @ITEM=12=nnps
    @ITEM=13=vbd
    @ITEM=14=cd
    @ITEM=15=vbg
    @ITEM=16=to
    @ITEM=17=wp
    @ITEM=18=nnp
    @ITEM=19=vb
    @ITEM=20=md
    @ITEM=21=vbp
    @ITEM=22=wdt
    @ITEM=23=ex
    @ITEM=24=uh
    @ITEM=25=jjs
    @ITEM=26=jjr
    @ITEM=27=g
    @ITEM=28=rbs
    @ITEM=29=p
    @ITEM=30=v
    @ITEM=31=rbr
    @ITEM=32=h
    @ITEM=33=x
    @ITEM=34=k
    @ITEM=35=l
    @ITEM=36=o
    @ITEM=37=m
    @ITEM=38=n
    POS tags found in Stanford NLP Tagger
    1. CC Coordinating conjunction
    2. CD Cardinal number
    3. DT Determiner
    4. EX Existential there
    5. FW Foreign word
    6. IN Preposition or subordinating conjunction
    7. JJ Adjective
    8. JJR Adjective, comparative
    9. JJS Adjective, superlative
    10. LS List item marker
    11. MD Modal
    12. NN Noun, singular or mass
    13. NNS Noun, plural
    14. NNP Proper noun, singular
    15. NNPS Proper noun, plural
    16. PDT Predeterminer
    17. POS Possessive ending
    18. PRP Personal pronoun
    19. PRP$ Possessive pronoun
    20. RB Adverb
    21. RBR Adverb, comparative
    22. RBS Adverb, superlative
    23. RP Particle
    24. SYM Symbol
    25. TO to
    26. UH Interjection
    27. VB Verb, base form
    28. VBD Verb, past tense
    29. VBG Verb, gerund or present participle
    30. VBN Verb, past participle
    31. VBP Verb, non¬3rd person singular present
    32. VBZ Verb, 3rd person singular present
    33. WDT Wh¬determiner
    34. WP Wh¬pronoun
    35. WP$ Possessive wh¬pronoun
    36. WRB Wh¬adverb

    After the POS tagging the each book has to be represented as a set of sequences of pos tags. but in the above dataset link the entire book is represented as a single sequence. I found -2 only once at the end in the dataset (As per the spmf format -2 represents the end of a sequence ) and was Unable to identify different sentences as different sequences.

    In order to find top k-sequential patterns (pos tag patterns) in a book , I need to represent a book as a sequence database. but according to the link the entire book is representing a single sequence. Do the TKS algorithm be applicable in this case?

    So, Kindly help me to understand how you applied the TKS algorithm to a book to find the top k-pos tag patterns. Give me clarity in this regard sir.

    If possible kindly share me the pre-processing phase code used for the above paper and also you mentioned that you have modified the TKS algorithm for this paper if so kindly share that code also to understand it thoroughly.

    Thank you.

    • Good afternoon,

      Thanks for reading the blog, the papers, and using the software. And I am sorry to answer a little bit late. It is currently the summer vacation and I did not check the blog for a few days.

      I see what you mean. I will check the files and look for the code and send you an e-mail to your gmail address that you used here.

      Best regards,
      Philippe

  25. Yann Eric says:

    Bonjour Philippe, merci pour le contenu de ton blog et aussi de ta chaine youtube
    c’est un énorme plaisir de les consulter.

    je réécris moi même un algorithme pour obtenir des règles d’association actuellement
    et je butte sur une question de choix.

    je dois choisir entre prendre toutes les commandes pour l’algo et prendre que les commandes où le client a acheté plus de 2 produits.
    je perd de l’information sur le support des produits les plus importants si je prend le deuxième choix mais j’hésite encore.
    Avez vous des pistes pour moi .

    j’ai pas de problème de ressources car j’écris en SQL sur le cloud de google (bigquery).

    • Bonjour Yann,

      Je m’excuses du délai de réponse. Je suis présentement en vacances et un peu débordé par le travail, malgré tout. Je n’ai donc pas regarder les commentaires du blog depuis quelques temps. Merci de suivre le blog et la chaîne youtube. C’est très apprécié!

      Maintenant pour répondre à la question, je ne suis pas qu’il y ait un choix qui soit meilleur que l’autre, mais on peut discuter du résultat.

      Approche 1:
      – Le support va être calculé de façon exacte.
      – Les transactions à un ou deux items ne sont peut-être pas tellement intéressantes. Cela dépend de l’application. Si c’est des achats dans un magasins, cela peut sembler raisonnable, mais encore il faudrait peut-être voir les résultats que cela donne d’un point de vue pratique et analyser cela pour vérifier.

      Approche 2:
      – Enlever les transactions de 1 ou 2 items devrait accélérer l’algorithme. Par contre, comme tu me dis, cela n’est peut-être pas un priorité pour toi. Donc pourquoi vouloir éliminer ces transactions? Je n’ai peut-être pas bien saisi. Je pense que l’avantage principale serait d’améliorer la performance.

      Maintenant, j’aimerais parler un peu de façon plus générale. En général, si c’est pour une application comme analyser ce que les consommateurs achètent dans un magasin, il n’est peut-être pas nécessaire d’avoir des règles d’association contenant beaucoup d’items. Et peut-être que les règles avec seulement un item dans le conséquent serait suffisante pour ce que tu fais. Cela dépend en fait un peu de ton but, c’est-

  26. Jean-Francois Bernier says:

    Hello ! Thanks for your numerous contributions to this field of data science ! After reading many papers on the topic, I still can’t wrap my head around what’s the difference between sequence mining and episode mining. Those two approaches are strikingly similar at a bird-eye view, they seem to ingest roughly the same kind of sequences as input, and have very similar parameters and functionalities (like windowing) and yet the interpretation of the results seem to differ slightly. I was wondering if you could clarify the differences between those two types of problems, and in which context one would be preferable over the other. Thank you !

    • Goood afternoon,

      Thanks for your comment. Yes, those two topics are indeed very similar. The main difference is that in episode mining the input is a single sequence that is very long, while in sequential pattern mining, the input is multiple sequences (a sequence database).

      Then, in episode mining, the goal is to find some subsequences that appear many time in a long sequence.
      And, in sequential pattern mining, the goal is to find some subsequences that are common to many sequences.

      For example, let’s say that we have a very long sequence of locations visited by a person in a city. We could apply episode mining to find some sequences of locations often visited by that person.

      Now let’s say that you have the sequences of locations visited by hundreds of people in a city. Then you could apply sequential pattern mining to find some sequences of locations that are common to many persons.

      So that is the main difference between those two tasks! Hope it is clearer. Thanks for reading the blog!

  27. K. P. Birla says:

    Sir,
    Thank you for providing wonderful resources to study HUPM on spmf.
    Please provide the updated map-figure of https://www.philippe-fournier-viger.com/spmf/map_algorithms_spmf_data_mining097.png

    • Thanks for your comment! I am currently working on the new version of SPMF and making several improvements that will be released soon!

      I appreciate your suggestion and will try to update the map soon as well. Best regards,

      • K. P. Birla says:

        Thank you, sir, for your response! I’m eagerly looking forward to the new version of the updated map. I have another query: HUPM algorithms take the value of “minUtil threshold (that is, delta)” as an input value given by the user. The item-wise external utility values are also assumed by the user. It is observed from the published articles that the values for “minUtil threshold” are in the range of 0.0001 to 0.0005, and it’s values (range) are different in other articles for the same dataset of experiments. How much is 0.0001, and with respect to what? Why so? Is there any thumb rule to decide those ranges? Also, if no item- or transaction-wise utility values are mentioned in some datasets, then how do we carry out experiments?

        • Good morning, That is a good question. Let me explain.
          To evaluate a high utility pattern mining algorithm in terms of performance, authors will do some experiment where the “minutil” parameter is varied to see how it influence the performance (runtime and sometimes also memory).
          The goal of such experiments is to compare a few algorithms to determine which one is better, and in which situations.
          Generally, the performance of HUPM algorithms will decrease as the “minutil” parameter is decreased. More specifically, as “minutil” is decreased, the search space can increase exponentially, and the runtime can also increase very quickly.
          So because of that, researchers will typically not decrease “minutil” until it reaches 0.
          Instead, researchers will choose a range of values where the algorithms can run in a reasonable amount of time and such that we can clearly see the performance difference between the compared algorithms.
          That range of values will depend on the dataset characteristics (is it dense, sparse, does it contain short or long transactions, etc.) because these characteristics will greatly influence the runtime.
          In fact, it is possible that some algorithm can take hours to run for some high minutil value on a dataset, while it may take seconds for the same value on another dataset. And this is just because of the characteristics of the datasets.
          So because of that, the range of values dont need to be the same on all datasets.
          Now, as you have notice, different authors may use different range of values for the same datasets. That can be ok. But generally, if you propose a new algorithm and compare with a previous algorithm, you should use a range that is at least the same or larger. That is to provide a fair experiment. For example, lets say that authors of an algorithm A have used the range 0.03 – 0.04 on a dataset X. Then if you propose a new algorithm Y and compare with algorithm A, you should use the same range or a range that is larger like 0.01 – 0.04. Then the comparison is fair. But if you use a range like 0.04 – 0.05, then the comparison is not fair because you did include the range used by the authors of algorithm A, and reviewers may have suspicions about why you did that.
          So to summarize:
          (1) If you are comparing with an existing algorithm, you should use the same range or a range that is larger.
          (2) You dont need to use the same ranges on all datasets. What is important is that you choose a range that can clearly show a difference between the algorithms if there is one.
          (3) And generally, if you are doing performance experiments, you need to try to push the algorithms to their limit. I always recommend to do experiments where algorithms are run for at least 50 seconds. If you do experiments where algorithm terminate in a too short time like 0.1 seconds, it will not be very convincing.

          Hope that it helps

          • Also, when you write a paper, you should clearly explain how you conducted the experiments and how you selected the range of values.

            For example, you can say that for the experiments, you have selected ranges of values that could clearly display the differences between the compared algorithms and that you have used ranges that are the same as in previous papers. You could also say that you have run the algorithm with a maximum time limit (e.g. 1000 seconds) and that if algorithms exceeded that time limit, you did not include the results of the experiments.

  28. K. P. Birla says:

    Thank you sir for replying with detailed explanations.

    Referring to https://www.philippe-fournier-viger.com/spmf/map_algorithms_spmf_data_mining097.png

    Dear Sir, Can you please let me know the last update (the latest paper) considered and included into the map?
    After ‘Cori’, what are the further developments in area of ‘Highly correlated itemset mining’?
    Please also give your valuable comments on recent updates about correlation measures. What are the differentiating criteria to choose the measures?
    Thank you.

  29. K. P. Birla says:

    “5.1 Dataset and Experimental Setup:
    ….It is important to notice that both the FHM and FDHUP algorithms
    are varied by one parameter minUtil, while the CoHUIM algorithm discovers the
    CoHUIs by using both correlation and utility constraints. Therefore, experiments are conducted
    on datasets by varying minUtil. In addition, the minCor is adjusted with five times
    on each dataset to verify the efficiency of the proposed CoHUIM algorithm and respectively
    denoted them as CoHUIMminCor1 , CoHUIMminCor2 , CoHUIMminCor3 , CoHUIMminCor4 ,
    and CoHUIMminCor5 . Specifically, the different five minCor thresholds are respectively set
    as: (a) in foodmart, 0.01, 0.02, 0.03, 0.04, 0.05; (b) in retail, 0.10, 0.15, 0.20, 0.25, 0.30; (c)
    in chess, 0.80, 0.810, 0.82, 0.83, 0.84;”…. – Wensheng Gan et. al, “Extracting Non-redundant Correlated Purchase Behaviors by Utility Measure”, 2017, DOI: 10.1016/j.knosys.2017.12.003

    Table 4: Number of patterns when minUtil is varied in foodmart:

    minUtil: 0.006% | 0.007% |0.008% …
    HUIs: 93418 | 49821 |26176
    CoHUIsminCor1 93418 | 49821 |26176
    CoHUIsminCor2: 24857 | 17127 |12026
    .
    .
    .

    Referring to above text in from the above cited article, What are the actual values of minUtil and CoHUIMminCor1 , CoHUIMminCor2.
    How minUtil AND CoHUIMminCor1 , CoHUIMminCor2 are calculated?

    • When minUtil is expressed as a percentage, it usually means a percentage of the total utility of the database.
      For example, if the sum of all utility values in a database is 1,000,000, then minutil = 1 % means that minutil = 1% * 1,000,000 = 10,000.

      As for minCor, you seem to have mentioned the values in your comment:
      as: (a) in foodmart, 0.01, 0.02, 0.03, 0.04, 0.05; (b) in retail, 0.10, 0.15, 0.20, 0.25, 0.30; (c)
      in chess, 0.80, 0.810, 0.82, 0.83, 0.84;”…

      I did not read the paper. I just answer you based on the information you provided.

      • K. P. Birla says:

        Thank you very much sir.
        Your insightful videos on pattern mining have been immensely beneficial.
        Awaiting additional videos covering the following topics:
        1) Exploring algorithm performance analysis across balanced and imbalanced datasets, as well as sparse and dense datasets.
        2) Strategies for selecting appropriate data structures and pruning techniques within above mentioned contexts.
        3) While many authors emphasize runtime and memory size in their studies, how can one effectively validate and compare the accuracy of extracted patterns?

Leave a Reply

Your email address will not be published. Required fields are marked *