Report of the PAKDD 2014 conference (part 2)

This post will continue my report of the PAKDD 2014 in Tainan (Taiwan).


About big data

Another interesting talk at this conference was given by Jian Pei. The topic was Big Data.

Some key ideas in this talk was that to make a technology useful, you have to make it small and invisible. A system relying on data mining may have to detect when a user needs a data mining service and provides the service as early as possible.

Other desirable characteristics of a data mining system are that a user should be able to set preferences. Moreover, if a user interactively changes its preferences, results should be updated quickly.   A data mining system should also be context aware.

It was also mentionned that big data is always relative.  Some papers in the 1970s were already talking about large data and recently some conference have even adopted the theme “extremely large databases”. But even if “big” is relative, since 2003 we record more data every few days in the world than everything that had been recorded before.

Social activities and organization

In general, PAKDD was very-well organized. The organizers did a huge job. It is personally one of the best conference that I have attended in terms of organization.  I was also able to met many interesting people from the field of data mining that I had not met before.

The social activities and banquet were also nice.

Location of PAKDD 2015

The location of PAKDD 2015 was announced. It will be in Ho Chi Minh City, Vitenam from 19-22 May 2014.  The website is

The deadline for paper submission is 3 October 2014 and notification is the 26 December 2014.

Continue reading my PAKDD 2014 report (part 1) here   or part 3 here


That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Report of the PAKDD 2014 conference (part 1)

I am currently at the PAKDD 2014 conference in Tainan, In this post, I will report interesting information about the conference and talks that I have attended.


Importance of Succint Data Structures for Data Mining

I have attended a very nice tutorial by R. Raman  about succint data structures for data mining.  I will try to report some main points of this tutorial here as faithfully as possible (but it is my interpretation).

A simple definition of what is a succint data structure is as follows. It is a data structure that uses less memory than a naive data structure for storing the some information.

Why should we care?

  • A reason is that to perform fast data mining, it is usually important to have all the data into memory. But sometimes the data cannot fit. In the age of big data, if we use some very compact data structures, then we can fit more data into memory and perhaps that we don’t need to use distributed algorithms to handle big data.  An example that was provided in the tutorial is a paper by Cammy & Zhao that used a single computer with a compact structure to beat a distributed map reduce implementation to perform the same task.  If we can fit all data into the memory of a single computer, the performance may possibly be better because data access is faster on a single computer than if the computation is distributed.
  • A second reason is that if a data structure is more compact, then in some cases a computer may store more memory in its cache and therefore access to the data may even be faster.  Therefore, there is not always a negative effect on execution time when data is more compressed using a succint data structures.

What characteristics a compressed data structure should provide?

  • One important characteristic is that it should compresses information and an algorithm using the data structure should ideally be able to work directly on it without decompressing the data.
  • Another desirable characteristic is that it should provide the same interface as an uncompressed data structure. In other words, for an algorithm, we should be able to replace the data structure by a compressed data structure without having to modify the algorithm.
  • A compressed data structure is usually composed of data and an index for quick access to the data.  The index should be smaller than the data.
  • Sometimes a trade-off is redundancy in the data structure vs query time. Reducing redundancy may increase query time.

There exists various measures to assess how much bits are necessary to encode some information: naive, information-theoretic, entropy…  If we design a succint data structure and we use more memory than what is necessary using these measures, then we are doing something wrong.

In the tutorial, it was also mentionned that there exists several libraries providing succint data structure implementations such as Sux4J, SDSL-lite, SDSL…

Also many examples of succint data structures were provided such as binary trees implemented as a bit vectors, multibit trees, wavelet trees, etc.

On applications of association rule mining

Another very interesting talk was given by G. Webb. The talk first compared association rule mining with methods from the field of statistics to study associations in data. It was explained that:

  • statistics often tries to find a single model that fit the data, wherehas association rules discovers multiple local models (associations), and let the user choose the best models (which rule better explain the data).
  • association rule mining is scalable to high dimensional data, wherehas classical techniques from statistics cannot be applied to a large amount of variables

So why association rule mining is not so much used in real applications? It was argued that the reason is that researchers in this field focus too much on performance (speed, memor) rather than on developing algorithms that can find unusual and important patterns.  By focusing only on finding frequent rules, too much “junk” is presented to the user (frequent rules that are obvious).  It was shown that in some applications, actually, it is not always the frequent rules that are important but the rare ones that have a high statistical significance or are important to the user.

So what is important to the user? It is a little bit subjective. However, there are at least four principles that can help to know what is NOT important to the user.

  • 1) If frequency can be predicted by assuming independency then the association is not important. For example, finding that all persons having prostate cancer are males is an uninteresting association, because it is obvious that only male can get prostate cancer.
  • 2) Redundant associations should not be presented to the user. If an item X is a necessary consequence of a set of items Y, then {X} U Y should be associated with everything that Y is. We don’t need all these rules. In general, we should either keep simple of complex rules (we should remove redundant rules)
  • 3) doing some statistical tests to filter non significant associations

Also, it is desirable to mine association efficiently and to be able to explain to the user why some rules are eliminated if necessary.

Also, if possible we may use top-k algorithms where the user chooses the number of patterns to be found rather than using the minsup threshold. The reason is that sometimes the best associations are rare associations.

These were the main ideas that I have noticed in this presentation.

Continue reading my PAKDD 2014 report (part 2) here


That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

New version of SPMF Java open-source data mining library (0.95)

Today, I write a post to announce a new version of the SPMF Java open-source data mining library.  It is SPMF version 0.95 and it is a major revision. It offers 11 new  data mining algorithms for various data mining tasks, including several sequential pattern mining algorithms.

The list of the new algorithms is as follows:

  • TKS for top-k sequential pattern mining
  • TSP for top-k sequential pattern mining
  • VMSP for maximal sequential pattern mining
  • MaxSP for maximal sequential pattern mining
  • estDec algorithm for mining recent frequent itemsets from a stream (by Azadeh Soltani)
  • MEIT (Memory Efficient Itemset-Tree), a data structure for targeted association rule mining
  • CM-SPAM for sequential pattern mining
  • CM-SPADE for sequential pattern mining
  • CM-CLaSP for closed sequential pattern mining
  • PASCAL for mining frequent itemsets and identifying generators

Below, I will briefly explain the importance of these new algorithms.

Algorithms for maximal sequential pattern mining

Mining sequential patterns can often produce a lot of sequential patterns. To address this issue, several algorithms have been propoved to mine less but more representative sequential patterns. Some popular subsets of sequential patterns are closed and maximal sequential patterns.  In previous version of SPMF there was algorithms for closed sequential pattern mining but no algorithms for maximal sequential pattern mining.  Now, two recent state-of-the art algorithms have been added for this task : VMSP  (2014) and MaxSP (2013).

It is interesting to mine maximal sequential patterns because it can produce up to an order of magnitude less patterns than closed sequential patterns.

Faster algorithms for sequential pattern mining

The new version also offers faster algorithms for sequential pattern mining and closed sequential pattern mining.

CM-SPADE (2014) outperform all the previous algorithms in SPMF in most cases.  CM-SPAM (2014) offers most of the time the second best performance.

Furthermore, the CM-ClaSP algorithm (2014) outperforms the CloSpan, ClaSP and BIDE+ algorithms for closed sequential pattern mining.

You can see a performance comparison below on six public datasets to see the speed improvement from these new algorithms (more details on the “performance” page of the SPMF website).

New algorithm for mining frequent itemsets from a stream

A new algorithm for stream mining is also introduced. It is the estDec algorithm (2003). It is a classical algorithm for mining frequent itemsets from stream that put more importance on recent transactions. Previously, there was only one stream mining algorithm in SPMF named CloStream. But CloStream has some important limitations: (1) it does not use a minimum support threshold therefore it can find a huge amount of patterns and (2) it put as much importance on old transactions as on recent transactions from the stream.  estDec adresses these issues by using a threshold and putting less importance on older transactions.

New algorithm for frequent itemset mining

An implementation of the PASCAL algorithm is introduced for frequent itemset mining. PASCAL is a classical algorithm based on Apriori. The main improvement in PASCAL over Apriori is that it can correctly guess the support of some itemsets directly without having to scan the database.  This is done based on the following property from lattice theory: when an itemset is not a “generator”, its support is the minimum support of its subsets.  I will not define the concept of generator here. If you are curious, you can have a look at the example for Pascal in the documentation of SPMF for more details.

New algorithm for targeted itemset and association rule mining

A new data structure is also introduced named the Memory Efficient Itemset Tree (MEIT). This structure is an improvement of the Itemset-Tree structure that can use up to 45 % less memory but is a little bit slower (there is a trade-off between memory and speed).  If you don’t know what is an Itemset-Tree, it is a structure for performing queries about itemsets and association rules. You can imagine for example a software that has to perform many queries such as “give me all the association rules with itemset X in the consequent”, “give me all the rules with item Z in the consequent”, etc.  An Itemset-Tree is a structure that is optimized for answering such queries quickly and that can be updated incrementally by inserting new transactions. If you are curious about this, you can have a look at the example for Memory-ItemsetTree and Itemset Tree in the documentation of SPMF for more details.

Bug fixes and other improvements

Lastly, this new version also include a few bug fixes. One notable bug fix is for a bug in the ID3 implementation.  I have also improved the documentation to explain clearly what are the input and output format of each algorithm.  I have also updated the map of data mining algorithms offered in SPMF:

That is all I wanted to write for today! Hope that you will enjoy this new version of SPMF !

If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

How to get citations for your research papers?

Today, I will present some tips about how to write research papers that get many citations. Obviously, some of these tips may depend on the specific topics that you are working on and there is also some exceptions to these advices.



  • Publish solution to popular problems. If you are working on a topic that is popular, it is more likely to have an impact than if you are working on a topic that is not popular.
  • Give enough details about your method. If your paper is detailed enough so that someone can implement your work by reading your paper or reproduce your experiments, then it is more likely that someone else will use your work.
  • Publish a  solution that is generalizable to multiple applications.  If you  propose a solution to a problem that has very few applications, then less people may reuse or extend your work. But if you publish something that is reusable in many situations,  then chances are probably higher that you will get cited.
  • Write well your title, abstract and introduction,. Even if you do some good research, if your paper is badly written then it will have less chance to get cited.  The title, keywords and abstract should be carefully selected so that people that only read the abstract and title will know what your paper is about.
  • Publish in good conferences and journals. Citing articles published in excellent journals and conferences is generally good in a paper. Therefore, if you publish in an excellent conference or journal, your paper has more chance of being cited. Moreover, excellent conferences and journals will give more visibility to your papers. Besides, it is generally preferable to publish in a conference that is closely related to the topic of your paper, rather than publishing in a conference that is unrelated or is too general.
  • Put your papers online! Nowadays, most researchers are searching on Google to find articles instead of going to the library.  Therefore, it is important that your articles are accesibles on the Web freely. A good way to do this is to create a webpage and publish your PDFs on your website (note that it is important to check what is the copyright agreement that you have signed with your publisher to see if you have the right to put PDF on your website). Also, you may use some researcher social networks such as ResearchGate and to publish your papers or archival repositories such as Arxiv.
  • Sometimes cite your own papers. It is a good idea to sometimes cite a few of your own papers. Obviously, one should not cite too many of his own papers in a paper. But citing a few is generally ok. For some search engines like Google Scholar, papers are ordered by the number of citations. If your papers have citations (even  self-citations), then your paper will appear before papers that have no citations in the results.
  • Write a survey paper.  Survey papers usually get a lot of citations because many people read them and it is useful to cite survey papers in an article.   Therefore, writing a survey papers may bring you a lot of citations.
  • Mention your papers on social media or blogs. You may use Twitter, Facebook, Google Plus and other social media platform to talk about your research.  Another good idea to promote your work is to write a blog post about your research work or convince someone else to write a blog post about your work.
  • Give talks at universities or conferences.  Another good way to promote your research is to give invited talks at other universities and at conferences.
  • If you are directly extending the work of someone else, you may send him your paper by e-mail, as he may be interested in your work.
  • Built a reputation for publishing excellent work. As you become more famous, people will start to search your name to find your other papers, or sometimes look at your website.
  • Publish your datasets.  If you publish your datasets, other researchers may reuse them, and if they reuse them, they will cite you.
  • Publish your source code.  If you are providing your source code or binaries for a computer program,  other researchers may choose to use your work instead of the work of someone else simply because they will save time by not having to implement the work of someone else.

This is just a few tips. There are certainly more things that could be said on this topic. But it is enough for today. Hope that you have enjoyed this post.  If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Discovering and visualizing sequential patterns in web log data using SPMF and GraphViz

Today, I will show how to use the open-source SPMF data mining software to discover sequential patterns in web log data. Then, I will show to how visualize the frequent sequential patterns found, using GraphViz.

Step 1 :  getting the data.

The first step is to get some data.  To discover sequential patterns, we will use the SPMF software. Therefore the data has to be in SPMF format.  In this blog post, I will just use the web log data from the FIFA world cup from the datasets webpage of the SPMF website. It is already in SPMF format.

Step 2 : extracting sequential patterns.

Then using the SPMF.jar file downloaded from the SPMF website, I have applied the PrefixSpan algorithm to discover frequent sequences of webpages visited by the users.  I have set  the minsup  parameter of PrefixSpan to 0.15, which means that a sequence of webpages is frequent if it is visited by at least 15% of the users.

The result is an output file containing 5123 sequential patterns in a text file. For example, here are three patterns from the output file:

155 -1 44 -1 21 -1 #SUP: 4528
 147 -1 #SUP: 7787
 147 -1 59 -1 #SUP: 6070
 147 -1 57 -1 #SUP: 6101

The first one represents users vising the webpage 155, then visiting webpage 44 and then 21.  The line indicates that this patterns has a support of 4528, which means that this patterns is shared by 4528 users.

The result may be hard to understand in a text file, so we will next visualize them using GraphViz.

Step 3: Transforming the output file into GraphViz DOT format.

I use a very simple piece of Java code to transform the sequential patterns found by SPMF to the GraphViz DOT format.

import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;

 * @author Philippe Fournier-Viger, 2014
public class MainDatasetConvertSequentialPatternsToDotFormat {

    public static void main(String [] arg) throws IOException, InterruptedException{

        String input = "C:\\patterns\\test.txt";
        String output = "C:\\patterns\\";

        BufferedReader myInput = new BufferedReader(new InputStreamReader( new FileInputStream(new File(input))));

        // map to remember arrow already seen
        Map<String, String>  mapEdges = new HashMap<String, String>();

        // for each line (pattern) until the end of file
        String thisLine;
        while ((thisLine = myInput.readLine()) != null) {
            if (thisLine.isEmpty() == true) {

            // split the pattern according to the " " separator
            String split[] = thisLine.split(" "); 
            boolean firstItemOfItemset = true;
            boolean firstItemset = true;

            String previousItemFromSameItemset = null;

            // for each token
            for(String token : split) {
                if("-1".equals(token)) { // if end of an item

                }else if("-2".equals(token) || '#' == token.charAt(0)){ // if end of sequence
                    previousItemFromSameItemset = null;
                }else { // if an item
                    if(previousItemFromSameItemset != null) {
                        mapEdges.put(previousItemFromSameItemset, token);
                    previousItemFromSameItemset = token;

        BufferedWriter writer = new BufferedWriter(new FileWriter(output));     
        writer.write("digraph mygraph{");
        for(Entry<String, String> edge : mapEdges.entrySet()) {
            writer.write(edge.getKey() + " -> " +edge.getValue() + " \n");
        // Note: only sequential patterns of size >=2 are used to create the graph and 
        // patterns are assumed to have only one item per itemset.


Step 4: Generating a graph using GraphViz

Then I installed GraphViz on my computer running Windows 7. GraphViz is a great software for the visualization of graphs and it is not very hard to use. The idea is that you feed GraphViz with a text file describing a graph and then he will automatically draw it. Of course there are many options that can be selected and here I just use the basic options.

I use the command:  dot -Tpng > output.png"

This convert my DOT file to a graph in PNG format.

The result is the following (click on the picture to see it in full size).

sequential_patterns_spmf_graphvizThe result is pretty interesting.  It summarizes the 5123 patterns into an easy to understand graph. By looking at the graph we can easily see that many sequential patterns passes by the nodes 44 and 117, which must be very important webpages on the website.

Note that using this technique, we loose a little bit information. For example, the support of edges is not indicated on this graph.  It may be possible to improve upon this visualization for example by using various colors to indicate the support or the number of patterns that pass by a node.

Hope that you have enjoyed this post.  If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

What to do when your conference paper get rejected?

Today, I will discuss what to do when a paper that you have submitted to a conference get rejected.

paper rejected

When submitting papers to a conference there are generally many papers that are submitted and get rejected. This is especially true for competitive conferences, where less than 1/4 of the papers get accepted, or sometimes even less than 1/10.

In the event where your paper get rejected, it is easy to take it personal and think that your research is not good or that you do not deserve to be published.It is also easy to blame the conference or the reviewers. However, a better attitude is to try to understand the reasons why your paper got rejected and then to think about what you can do to avoid the problems that lead to the rejection of your paper so that you can submit it somewhere else and that it can be accepted.

First, I often hear the complaint that a paper got rejected because the reviewers are not specialist and did not understand the paper.  Well, it may be true. Sometimes, a paper get assigned to a reviewer that is not a specialist in your domain because you are not lucky or that a reviewer do not have enough time to read all the details of your paper. This can happen. But often the real reasons are:

  1. The paper was submitted to a conference that was too broadly related to the topic of the paper. For example, if you submit a paper about a data mining algorithm to a general computer science or artificial intelligence conference, it is possible that no data mining expert will read your paper.  Choosing a conference is a strategic decision that should not be taken lightly when planning to submit a paper. A good way to choose a conference is to look at where papers similar to your topic have been published. This will give you a good idea about conferences that may be more “friendly” toward your research topic.
  2. Another possible reason is that your paper did not clearly highlight what is your contributions in the introduction. If the contributions of your paper are not clearly explained in the introduction, then the reviewer will have to guess what they are.  From my experience, the top three parts that needs to be well-written in a paper are :  (1) introduction, (2) experimental results and (3) conclusion. I have discussed with some top researchers and they have told me that they often first just look at these three parts. Then, if the paper looks original and good enough, they may also look at the method section of your paper. For this reason, introduction and conclusion should be very clear about what are your contributions.
  3. It is also possible that the reviewers did not understand why your research problem is interesting or challenging. In this case, it may also be a problem with the presentation. Your introduction should convince the reader that your research problem is important and challenging.

Second, another complaint that I often hear is that the reviewer did not understand something important about the technical details of your paper.  Some reasons may be:

  • It may be an issue with the presentation.  Even if you are right that all the details were correctly presented in your paper, it is possible that the reviewer got bored reading the paper because of a poor presentation, or the lack of examples. Don’t forget that a very busy reviewer will not spend days reading your paper. Often a reviewer may just have a few hours to read it. In this case, rethinking the presentation of your paper to make it easier to read or more clear with respect to what the reviewer did not understand  is a good idea.
  • Another problem may be that the reviewer is not an expert in your field and that he may have some misconceptions about your field because he has not read much about it.  For example, recently, a paper about itemset mining got rejected and the reviewer said something like “oh, this is just like the algorithm X from 20 years ago”.  Well, this shows that the reviewer did not follow that field since a long time.  To avoid this kind of bad reviews, a solution is to add some text to avoid the common misconceptions that a reviewer that is not specialist in your field may have.  For example, recently, I was writing a paper about Itemset-trees, and I added a few lines to make it clear that this kind of trees are not the same as FP-Trees because many non-specialist will confuse them although there are very different because non-specialists usually only know the FP-Tree.

There are also some more serious reasons why a paper may be rejected. It may be  that your paper is technically flawed, that your experiments are not convincing, that the data or results do not look good or original, that your method is not well explained or not compared with other relevant methods, that the paper is very badly written, etc.  In these cases, the problem is more critical and it may be necessary to take the time to make a major improvement of your paper before submitting it. In this case, it may be better to take the time to seriously improve your paper instead of resubmitting it right away.

In any cases, if your paper is rejected, you probably already have spent a great deal of time on your paper and therefore it is generally a good idea to improve it and submit it somewhere else.

Lastly, I will give you a short story about one of my papers to give you hope if your paper got rejected. A few years ago, I submitted a paper to the conference Intelligent Tutoring Systems. It got rejected with bad reviews. Later, I almost submitted the same paper to EC-TEL, a good e-learning conference with an acceptance rate of about 1 /5. Then, the paper got accepted and it was even invited for a special issue of a good IEEE Transactions journal, and it was rated as one of the top 10 papers of the EC-TEL conference that year. So this is to tell you, that sometimes, it is possible to not get lucky and also that the choice of the conference may have a huge impact on if your paper get accepted or rejected. In my case, the same paper got a reject at ITS and was reviewers as one of the best papers at ECTEL, just by choosing a different conference.

So these are the advices that I wanted to write for today. Hope that you have enjoyed this post.  If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

How to answer reviewers for a journal paper revision?

Today, I will discuss how to prepare the revision of a journal paper after receiving the editor’s decision that the paper will be accepted with minor or major modifications.


As most people should know, when submitting a paper to a journal, the decision is usually “accept”, “accept with minor modifications”, “accept with major modifications”, “reject” or “resubmit as new”.  In the second and third cases, the author has to revise his paper according to the reviewers comments and to write a file generally called “summary of changes” to explain how the author has addressed the reviewers’ comments.

So let’s look at how to write a file “Summary of changes” to answer reviewers:

  • You should first say thank you to the reviewers for the useful comments that they have made to improve the paper.
  • Then, in the same file, you should explain how you have answered each reviewer comment that requires to do something.
  • To do that, I suggest to organize your document as follows. Create a section for each reviewer. Then, in each section copy the comments made by the reviewer and cite it as a quote (“…”). Then explain below the quote how you have addressed the comment.  For example, your document could look like this.


        First, we would like to say thank you to the reviewers for the useful comments to improve the paper. We have addressed all the comments as explained below.

        REVIEWER 1

                     “Section 3 is too long”

               We have addressed this comment by deleting a few lines at the end of the second paragraph that were not necessary for understanding the algorithm.

         REVIEWER 2

                Reviewer 2 has reported several typos and grammatical errors.  We have fixed all of them and proofread the paper to eliminate all such errors.

  • But how to answer a reviewer’s comment?
  • First, if you agree with the reviewer, you should do exactly what the reviewer ask you to do, and mention that you have done it.  Then the reviewer should be happy. Second, you can disagree with the reviewer and explain why you disagree.  In this case, you only need to explain why you disagree. But you need to explain well why.  Third, it is possible that the reviewer has made a comment that is inaccurate or that you have already addressed in your paper (but the reviewer did not saw it). In this case, you also need to explain that.  So overall, it is important to answer all comments.
  • From my experience, usually 2 or 3 reviewers are assigned to review each journal papers.  In top journals, the reviewers may be expert on your topic.  In journals that are not top journals, reviewers may not be very familiar with your topic but may still be good researchers.
  • Usually, if your paper receive the decision “accept with minor modifications”, there is a high chance that your paper will be accepted if you address the comments well.  If the decision is “accept with major modifications”, there is a risk that your paper may not be accepted if you don’t address the comments well, so you may need to work harder to convince the reviewers.
  • Usually, there is one, two or three rounds of reviews.  Generally, after the first revision, most comments have been addressed. Therefore, the job become easier after the first revision.  Usually, the editor wants that the reviews converge to a decision after about two round of reviews (the editor will likely intervene if reviewers always ask for more things to do).

So these are the advices that I wanted to write for today. Hope that you have enjoyed this post.  If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Top mistakes when writing a research paper

Today, I will discuss some common mistakes that I have seen in research papers and that every author should avoid !


  • Overlength papers.  When a paper is submitted to a conference or journal, there is generally a page limit. If the page limit is not respected, several reviewers will not like it. The reason is that reviewers are generally very busy and they have to review many papers. Reviewers should not have to spend more time reading a paper because someone did not want to spend time to make it fit within the page limit.
  • Proofreading and unreadable papers.  A paper should be well-written and the author(s) proofread it before submitting it.  I have seen on a few occasions some unreadable paper that look like they were automatically translated by Google.  This is a guaranteed reject.  Besides, if there are many typos because the author did not took the time to proofread their paper, it may bother the reviewers.
  • Paper that contains plagiarized content.  A paper that contains text copied from another paper more or less reduce your chance of being accepted, depending on the amount of text that is copied. All the text in your papers should be written by yourself only. It is easy for a reviewer to detect plagiarized content using the internet.
  • Incremental extension of the author’s previous work. I have recently read a paper where the author extended his own work, published just a few months ago. The problem with that paper was that the author just made a few minor changes before submitting it as a new paper. A new paper should present on a same topic should present at least 30 to 40 % new content and there should be a significant difference with the previous work.
  • Not citing recent work or works in top conferences, journals. Several reviewers check the dates of the references when evaluating a paper. For example,I have read a paper recently where all references where from before 2006.  This is a bad sign, since it is unlikely that nothing has happened in a given field since 2006.  When writing a paper, it is recommended to add a few newer references in your paper to show that you are aware of the newest research.  It also looks better if you cite articles from top conferences and journals.
  • Irrelevant information. Some papers contains irrelevant information or information that is not really important.  For example, if your paper is a data mining paper submitted to a data mining or artificial intelligence conference, it is not necessary to explain what is data mining. It can be assumed that the reviewers who are specialist in their field know what is “data mining”. Another example is to mention irrelevant details such as to why a given software was used to make charts.
  • Not comparing your proposal with the state-of-the-art solutions but instead with old solutions.  For example, if you propose a new data mining algorithm, it is expected that you will compare the performance with the current best algorithm to demonstrate that your algorithm is better at least in some cases.  A common error is to compare with some old algorithms that are not the best anymore.
  • Not citing correctly the sources, making factual errors or missing some important references.  Another mistake that I have seen many times is  authors that claim incorrect facts about previous work or ignore important previous work in their paper.  For example, I have read recently a paper that said that to do X, there is only two types of methods : A and B.  However, from my knowledge into this field, I know that there is three main approaches: A, B and C.  Another example, is a paper where the author propose to define a new problem and propose a new algorithm for that problem. However, the author fails to recognize that his problem can be solved using some existing methods published several years ago, and do not cite this work either.
  • Not showing that the problem that the author want to solve is important and challenging.  It is important in a research paper to show that the problem that you want to solve is important and is challenging. If the reviewer is not convinced in the introduction that your problem is important or challenging, it is a bad start.  To convince the reviewer, explain in that the problem is important and explains the limitations of current solutions for the current problem and why the problem is difficult.
  • Poor organization / Paragraphs should flow naturally. It is important that the various parts of the research papers are connected by a “flow”. What I mean is that when the reviewer is reading your paper, each section or paragraph should feel logically connected the previous and next paragraphs.
  • Figures/charts that do not look good or are too small.  About charts, it is important to make them look good. I have made a blog post about “How to make good looking charts for research papers?”  Besides, a second mistake is to make the charts or figures so small that they become unreadable. If the reviewer prints your paper to read it, he should be able to read the text without using a magnifying glass. Moreover, it should not be expected that the reviewer will read the PDF version of your paper and can zoom in.
  • Figures that are irrelevant. In some papers, authors put a lot of figures that are irrelevant. For example, if a figure can be summarized with one or two lines of text, it is better to remove it.
  • Not justifying the design decisions for the proposed solution. If the solution proposed in a research paper looks like a naïve solution, it can be bad. When a reviewer reads your paper, the paper should convince him that the solution is a good solution.  What I means is that it is not enough to explain your solution. You should also explain why it is a good solution. For example, for a data mining algorithms, you may introduce some optimizations.  To show that the optimizations are good optimizations, a good way is to compare your algorithm with and without the optimizations to show that the optimizations improve the performance.
  • Respecting the paper format. In general, it is important to respect the paper format, even if it is not the final version of the paper. In particular, I have seen several authors not respect the format for the references.  But it is important if you want to give a good impression.
  • Mention your contributions. Although it is not mandatory, I recommend to state the contributions that you will make in the paper in your introduction, and to mention them again in the conclusion. If you don’t mention them explicitly, then the reviewer will have to guess what they are.

Obviously, I could write more about this because there is many mistakes that one can make in a research paper. But here, I have just tried to show the most common mistakes.. Hope that you have enjoyed this post.  If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Why data mining researchers should evaluate their algorithms against state-of-the-art algorithms?

A common problem in research on data mining is that researchers proposing new data mining algorithms often do not compare the performance of their new algorithm with the current state-of-the art data mining algorithms.
data mining algorithm evaluation

For example, let me illustrate this problem.  Recently, I have done a literature review on the topic of frequent subgraph mining. I have searched for all recent papers on this topic and also older papers.  Then, based on this search, I have drawn the following diagram, where an arrow X –>Y from an algorithm X to an algorithm Y indicates that X was shown to be a better algorithm than Y in terms of execution time by the authors of X in an aexperiment.

What is the problem ?   The problem is that several algorithms have not been compared with each other.  Therefore if someone wants to know which algorithm is the best, it is impossible to know the answer without implementing the algorithms again and performing the comparison (because many of them have not disclosed their source code or binary files publicly).  For example, as we can see in the above diagram,

  • GSPAN (2002) was shown to be faster than FSG (2001), the initial algorithm for this problem.
  • Then FFSM (2003) was shown to be faster than GSPAN (2002).
  • Then, GASTON (2004) was shown to be faster than FSG (2001) and GSPAN (2002).  But it was not compared with FFSM (2003). It is therefore still unknown if GASTON is faster than FFSM (2003).
  • Thereafter, SLAGM (2007) was proposed and was not compared with any of the previous algorithms.
  • Then, FSMA (2008) was proposed and was only compared with SLAGM (2007), which was has still not been compared with any of the previous algorithms.
  • Then, FPGraphMiner (2011) was proposed.  It was compared with GSPAN (2002). But it was not compared with Gaston (2004), which was shown to be faster than GSPAN (2002). Moreover, it was not compared with FFSM (2003)

Note that on the above diagram boxes in green represents algorithms for closed subgraph mining (ClosedGraphMiner (2003)  and maximal subgraph mining (SPIN(2004) and MARGIN (2006)).  For these algorithms:

  • ClosedGraphMiner (2003) was compared with GSPAN (2002), which is fine since it was the best algorithm at that time.
  • SPIN (2004) was compared with FFSM (2003) and GSPAN (2002), which is fine since they were the two best algorithms at that time.
  • However, MARGIN (2006) was not compared with SPIN (2004). It was only compared with FFSM (2003). Moreover, it was not compared with GASTON (2004).

As you can see from this example, the topic of subgraph mining is a little bit of a “mess” if one wants to know which algorithm is the best.  It could be tempting to use the above diagram to make transitive inference. For example, we could use the links  FFSM <– MARGIN <– FPGraphMiner to infer that FPGraphMiner should be faster than FFSM). However, there is a risk to draw such inference, given that performance comparison can often vary depending on the choice of datasets and so on.

So what is the solution to this problem?   Personally,  I think that the solution is that researchers should make their source code or binary files public, as well as their datasets. This would facilitate the comparison of algorithm performance.  However, it would not solve all problems since different researchers may use different programming languages for example.    Another part of the solution would be that researchers would make the extra effort to implement more competitor algorithms when doing performance comparison, and that they should do their best to choose the most recent algorithms.

This is all of what I wanted to write for today! Hope that you have enjoyed this post.  If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.


Brief report about the ADMA 2013 conference

In this blog post, I will discuss my recent trip to the ADMA 2013 conference (9th Intern. Conf. on Advanced Data Mining and Applications in China (December 14-16 2013 in Hangzhou, China at Zhejiang University). Note that the view expressed in this post is my personal opinion about what I have attended at this conference.


First, let me say that I enjoyed this conference. ADMA is an international data mining conference that is always located in China but in different cities every year. The idea to always host ADMA in China is inspired by the SIAM DM conference, which is always hosted in USA. The proceedings of the ADMA conference are published inSpringer LNAI series.  The acceptance rate for papers with a long presentation was quite competitive this year (about 14 % ), while another 28% of submissions were accepted as papers with a short presentation.   The ADMA conference focuses on data mining techniques but also on data mining applications as its name implies.

Panel on Big Data

Since Big Data is a popular topic, there was a panel on big data including several top researchers such as Frans Coenen, Osmar Zaiane, Xindong Wu, etc.  The panel presented some interesting opinions about Big Data.

  • It was mentioned that although big data is a popular topic, it is usually hard to get public big data datasets on Internet.
  • A panelist state that big data is a fluke and that actually, most challenges raised by big data, were already known by the data mining community before all that hype. The same panelist mentioned that we should take advantage of this hype in the data mining community to create larger projects and apply them in the industry, get funding, etc.  However, the data mining community should be cautious not to make unachievable promises to the public or it could perhaps result into something like the “AI winter” for artificial intelligence, but for data mining.
  • Another panelist made an interesting comparison about “big data” and sex for teenagers stating that “everybody talks about it, everybody say it does it, but few people really do it” or something like that, if I remember well.
  • There was also some discussion about what is big data. For example, related to the panel, there was a keynote by Xindong Wu. He presented a characterization of what “Big data is” that he calls the “HACE theorem”, although it is not a theorem since there is no proof.  The main idea is that according to him big data should have the following characteristics: Heteoreneous, Autonomous (decentalized and distributed),  Complex and Evolving.
  • Some challenges of big data were mentioned such as: local learning and model fusion, mining sparse, uncertain and incomplete data,  mining complex and dynamic data,  mining big data stream and feature stream, and the problem of dimensionality.

Paper presentations

There was several interesting paper presentations. The best paper award went to a paper about data mining applied to a water network.  I did not have chance to choose much of the sessions that I wanted to attend because I had four papers presentations (one done by my student) and I was also a session chair.

The main topics this year were:

  • sequential data mining
  • opinion mining
  • behavior mining
  • stream mining
  • web mining
  • image mining
  • text mining
  • social network mining
  • classification
  • clustering
  • pattern mining
  • regression, prediction, feature extraction
  • machine learning
  • applications

Other aspects of the conference

Overall, the conference was very enjoyable. We also  had an excursion to the famous West Lake in Hangzhou. But unfortunately, it was raining a lot, and we even had a little bit snow!

Next year conference

An important announcement is that next year ADMA 2014 conference will be located in Guilin, China. It will be the 10th anniversary of the conference.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.