The Data Blog

Top mistakes when writing a research paper

Posted on 2014-01-05 by Philippe Fournier-Viger

Today, I will discuss some common mistakes that I have seen in research papers and that every author should avoid!

Overlength papers. When a paper is submitted to a conference or journal, there is generally a page limit. If the page limit is not respected, several reviewers will not like it. The reason is that reviewers are generally very busy and they have to review many papers. Reviewers should not have to spend more time reading a paper because someone did not want to spend time to make it fit within the page limit.
Proofreading and unreadable papers. A paper should be well-written and the author(s) proofread it before submitting it. I have seen on a few occasions some unreadable paper that look like they were automatically translated by Google. This is a guaranteed reject. Besides, if there are many typos because the author did not took the time to proofread their paper, it may bother the reviewers.
Paper that contains plagiarized content. A paper that contains text copied from another paper more or less reduce your chance of being accepted, depending on the amount of text that is copied. All the text in your papers should be written by yourself only. It is easy for a reviewer to detect plagiarized content using the internet.
Incremental extension of the author’s previous work. I have recently read a paper where the author extended his own work, published just a few months ago. The problem with that paper was that the author just made a few minor changes before submitting it as a new paper. A new paper should present on a same topic should present at least 30 to 40 % new content and there should be a significant difference with the previous work.
Not citing recent work or works in top conferences, journals. Several reviewers check the dates of the references when evaluating a paper. For example,I have read a paper recently where all references where from before 2006. This is a bad sign, since it is unlikely that nothing has happened in a given field since 2006. When writing a paper, it is recommended to add a few newer references in your paper to show that you are aware of the newest research. It also looks better if you cite articles from top conferences and journals.
Irrelevant information. Some papers contains irrelevant information or information that is not really important. For example, if your paper is a data mining paper submitted to a data mining or artificial intelligence conference, it is not necessary to explain what is data mining. It can be assumed that the reviewers who are specialist in their field know what is “data mining”. Another example is to mention irrelevant details such as to why a given software was used to make charts.
Not comparing your proposal with the state-of-the-art solutions but instead with old solutions. For example, if you propose a new data mining algorithm, it is expected that you will compare the performance with the current best algorithm to demonstrate that your algorithm is better at least in some cases. A common error is to compare with some old algorithms that are not the best anymore.
Not citing correctly the sources, making factual errors or missing some important references. Another mistake that I have seen many times is authors that claim incorrect facts about previous work or ignore important previous work in their paper. For example, I have read recently a paper that said that to do X, there is only two types of methods : A and B. However, from my knowledge into this field, I know that there is three main approaches: A, B and C. Another example, is a paper where the author propose to define a new problem and propose a new algorithm for that problem. However, the author fails to recognize that his problem can be solved using some existing methods published several years ago, and do not cite this work either.
Not showing that the problem that the author want to solve is important and challenging. It is important in a research paper to show that the problem that you want to solve is important and is challenging. If the reviewer is not convinced in the introduction that your problem is important or challenging, it is a bad start. To convince the reviewer, explain in that the problem is important and explains the limitations of current solutions for the current problem and why the problem is difficult.
Poor organization / Paragraphs should flow naturally. It is important that the various parts of the research papers are connected by a “flow”. What I mean is that when the reviewer is reading your paper, each section or paragraph should feel logically connected the previous and next paragraphs.
Figures/charts that do not look good or are too small. About charts, it is important to make them look good. I have made a blog post about “How to make good looking charts for research papers?” Besides, a second mistake is to make the charts or figures so small that they become unreadable. If the reviewer prints your paper to read it, he should be able to read the text without using a magnifying glass. Moreover, it should not be expected that the reviewer will read the PDF version of your paper and can zoom in.
Figures that are irrelevant. In some papers, authors put a lot of figures that are irrelevant. For example, if a figure can be summarized with one or two lines of text, it is better to remove it.
Not justifying the design decisions for the proposed solution. If the solution proposed in a research paper looks like a naïve solution, it can be bad. When a reviewer reads your paper, the paper should convince him that the solution is a good solution. What I means is that it is not enough to explain your solution. You should also explain why it is a good solution. For example, for a data mining algorithms, you may introduce some optimizations. To show that the optimizations are good optimizations, a good way is to compare your algorithm with and without the optimizations to show that the optimizations improve the performance.
Respecting the paper format. In general, it is important to respect the paper format, even if it is not the final version of the paper. In particular, I have seen several authors not respect the format for the references. But it is important if you want to give a good impression.
Mention your contributions. Although it is not mandatory, I recommend to state the contributions that you will make in the paper in your introduction, and to mention them again in the conclusion. If you don’t mention them explicitly, then the reviewer will have to guess what they are.

Obviously, I could write more about this because there is many mistakes that one can make in a research paper. But here, I have just tried to show the most common mistakes.. Hope that you have enjoyed this post. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Posted in General, Research | Tagged Research, research papers, writing | 6 Comments

Why data mining researchers should evaluate their algorithms against state-of-the-art algorithms?

Posted on 2014-01-01 by Philippe Fournier-Viger

sA common problem in research on data mining is that researchers proposing new data mining algorithms often do not compare the performance of their new algorithm with the current state-of-the art data mining algorithms.

For example, let me illustrate this problem. Recently, I have done a literature review on the topic of frequent subgraph mining. I have searched for all recent papers on this topic and also older papers. Then, based on this search, I have drawn the following diagram, where an arrow X –>Y from an algorithm X to an algorithm Y indicates that X was shown to be a better algorithm than Y in terms of execution time by the authors of X in an aexperiment.

What is the problem ? The problem is that several algorithms have not been compared with each other. Therefore if someone wants to know which algorithm is the best, it is impossible to know the answer without implementing the algorithms again and performing the comparison (because many of them have not disclosed their source code or binary files publicly). For example, as we can see in the above diagram,

GSPAN (2002) was shown to be faster than FSG (2001), the initial algorithm for this problem.
Then FFSM (2003) was shown to be faster than GSPAN (2002).
Then, GASTON (2004) was shown to be faster than FSG (2001) and GSPAN (2002). But it was not compared with FFSM (2003). It is therefore still unknown if GASTON is faster than FFSM (2003).
Thereafter, SLAGM (2007) was proposed and was not compared with any of the previous algorithms.
Then, FSMA (2008) was proposed and was only compared with SLAGM (2007), which was has still not been compared with any of the previous algorithms.
Then, FPGraphMiner (2011) was proposed. It was compared with GSPAN (2002). But it was not compared with Gaston (2004), which was shown to be faster than GSPAN (2002). Moreover, it was not compared with FFSM (2003)

Note that on the above diagram boxes in green represents algorithms for closed subgraph mining (ClosedGraphMiner (2003) and maximal subgraph mining (SPIN(2004) and MARGIN (2006)). For these algorithms:

ClosedGraphMiner (2003) was compared with GSPAN (2002), which is fine since it was the best algorithm at that time.
SPIN (2004) was compared with FFSM (2003) and GSPAN (2002), which is fine since they were the two best algorithms at that time.
However, MARGIN (2006) was not compared with SPIN (2004). It was only compared with FFSM (2003). Moreover, it was not compared with GASTON (2004).

As you can see from this example, the topic of subgraph mining is a little bit of a “mess” if one wants to know which algorithm is the best. It could be tempting to use the above diagram to make transitive inference. For example, we could use the links FFSM <– MARGIN <– FPGraphMiner to infer that FPGraphMiner should be faster than FFSM). However, there is a risk to draw such inference, given that performance comparison can often vary depending on the choice of datasets and so on.

So what is the solution to this problem? Personally, I think that the solution is that researchers should make their source code or binary files public, as well as their datasets. This would facilitate the comparison of algorithm performance. However, it would not solve all problems since different researchers may use different programming languages for example. Another part of the solution would be that researchers would make the extra effort to implement more competitor algorithms when doing performance comparison, and that they should do their best to choose the most recent algorithms.

This is all of what I wanted to write for today! Hope that you have enjoyed this post. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Posted in Data Mining, Programming, Research | 4 Comments

Brief report about the ADMA 2013 conference

Posted on 2013-12-25 by Philippe Fournier-Viger

In this blog post, I will discuss my recent trip to the ADMA 2013 conference (9th Intern. Conf. on Advanced Data Mining and Applications in China (December 14-16 2013 in Hangzhou, China at Zhejiang University). Note that the view expressed in this post is my personal opinion about what I have attended at this conference.

First, let me say that I enjoyed this conference. ADMA is an international data mining conference that is always located in China but in different cities every year. The idea to always host ADMA in China is inspired by the SIAM DM conference, which is always hosted in USA. The proceedings of the ADMA conference are published inSpringer LNAI series. The acceptance rate for papers with a long presentation was quite competitive this year (about 14 % ), while another 28% of submissions were accepted as papers with a short presentation. The ADMA conference focuses on data mining techniques but also on data mining applications as its name implies.

Panel on Big Data

Since Big Data is a popular topic, there was a panel on big data including several top researchers such as Frans Coenen, Osmar Zaiane, Xindong Wu, etc. The panel presented some interesting opinions about Big Data.

It was mentioned that although big data is a popular topic, it is usually hard to get public big data datasets on Internet.
A panelist state that big data is a fluke and that actually, most challenges raised by big data, were already known by the data mining community before all that hype. The same panelist mentioned that we should take advantage of this hype in the data mining community to create larger projects and apply them in the industry, get funding, etc. However, the data mining community should be cautious not to make unachievable promises to the public or it could perhaps result into something like the “AI winter” for artificial intelligence, but for data mining.
Another panelist made an interesting comparison about “big data” and sex for teenagers stating that “everybody talks about it, everybody say it does it, but few people really do it” or something like that, if I remember well.
There was also some discussion about what is big data. For example, related to the panel, there was a keynote by Xindong Wu. He presented a characterization of what “Big data is” that he calls the “HACE theorem”, although it is not a theorem since there is no proof. The main idea is that according to him big data should have the following characteristics: Heteoreneous, Autonomous (decentalized and distributed), Complex and Evolving.
Some challenges of big data were mentioned such as: local learning and model fusion, mining sparse, uncertain and incomplete data, mining complex and dynamic data, mining big data stream and feature stream, and the problem of dimensionality.

Paper presentations

There was several interesting paper presentations. The best paper award went to a paper about data mining applied to a water network. I did not have chance to choose much of the sessions that I wanted to attend because I had four papers presentations (one done by my student) and I was also a session chair.

The main topics this year were:

sequential data mining
opinion mining
behavior mining
stream mining
web mining
image mining
text mining
social network mining
classification
clustering
pattern mining
regression, prediction, feature extraction
machine learning
applications

Other aspects of the conference

Overall, the conference was very enjoyable. We also had an excursion to the famous West Lake in Hangzhou. But unfortunately, it was raining a lot, and we even had a little bit snow!

Next year conference

An important announcement is that next year ADMA 2014 conference will be located in Guilin, China. It will be the 10th anniversary of the conference.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Data Mining, General | Tagged adma, big data, conference, data mining | 5 Comments

Conference reviewers procrastinate?

Posted on 2013-11-22 by Philippe Fournier-Viger

Today, I will write about the work of reviewers for scientific conferences. As you probably know, when a researcher submit a paper to a conference, the paper is assigned to usually three or more reviewers. The program chair of the conference usually give one or two months to the reviewers to perform the reviews.

Recently, I had access to data about the number of reviews submitted by reviewers per day for a computer science conference. I have put this data on a chart that you can see below:

reviewing trend

It is interesting to see that reviewers seem to procrastinate. As it can be seen on the chart, during a reviewing period of one month, most reviews were submitted during the last few days, just before the deadline. Therefore, this raise the question of how much time does a reviewer really needs to perform review? If a typical reviewer would be given two months or three months instead of one month, would s/he also wait to submit the reviews during the last few days?

I have to say that reviewers are usually very busy and have to review several papers. A reason for late review may be that reviewers prefer to submit all of their reviews at the same time or are just too busy!

I just wanted to share this interesting observation. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

–
Philippe Fournier-Viger is the founder of the open-source data mining software SPMF, offering more than 50 data mining algorithms.

Posted in General, Research | Tagged conference, review, reviewer | 2 Comments

An introduction to frequent pattern mining

Posted on 2013-10-13 by Philippe Fournier-Viger

In this blog post, I will give a brief overview of an important subfield of data mining that is called pattern mining. Pattern mining consists of using/developing data mining algorithms to discover interesting, unexpected and useful patterns in databases.

Pattern mining algorithms can be applied on various types of data such as transaction databases, sequence databases, streams, strings, spatial data, graphs, etc.

Pattern mining algorithms can be designed to discover various types of patterns: subgraphs, associations, indirect associations, trends, periodic patterns, sequential rules, lattices, sequential patterns, high-utility patterns, etc.

But what is an interesting pattern? There are several definitions. For example, some researchers define an interesting pattern as a pattern that appears frequently in a database. Other researchers wants to discover rare patterns, patterns with a high confidence, the top patterns, etc.

In the following, I will give examples of two types of patterns that can be discovered from a database.

Example 1. Discovering frequent itemsets

The most popular algorithm for pattern mining is without a doubt Apriori (1993). It is designed to be applied on a transaction database to discover patterns in transactions made by customers in stores. But it can also be applied in several other applications. A transaction is defined a set of distinct items (symbols). Apriori takes as input (1) a minsup threshold set by the user and (2) a transaction database containing a set of transactions. Apriori outputs all frequent itemsets, i.e. groups of items shared by no less than minsup transactions in the input database. For example, consider the following transaction database containing four transactions. Given a minsup of two transactions, frequent itemsets are “bread, butter”, “bread milk”, “bread”, “milk” and “butter”.

T1: bread, butter, spinach
T2: butter, salmon
T3: bread, milk, butter
T4: cereal, bread milk

Fig. 1. a transaction database

Apriori can also apply a post-processing step to generate “association rules” from frequent itemsets, which I will not discuss here.

The Apriori algorithm has given rise to multiple algorithms that address the same problem or variations of this problem such as to (1) incrementally discover frequent itemsets and associations , (2) to discover frequent subgraphs from a set of graphs, (3) to discover subsequences common to several sequences, etc.

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.

To know more about this topic, two papers give a good overview of itemset mining techniques:

Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, Wiley, e1207 doi: 10.1002/widm.1207, 18 pages.
Luna, J. M., Fournier-Viger, P., Ventura, S. (2019). Frequent Itemset Mining: a 25 Years Review. WIREs Data Mining and Knowledge Discovery, Wiley, 9(6):e1329. DOI: 10.1002/widm.1329

Example 2. Discovering sequential rules

The second example that I will give is to discover sequential rules in a sequence database. A sequence database is defined as a set of sequences. A sequence is a list of transactions (as previously defined). For example in the left part of the following figure a sequence database containing four sequences is shown. The first sequence contains item a and b followed by c, followed by f, followed by g, followed by e. A sequential rule has the form X –> Y where X and Y are two distinct non empty sets of items. The meaning of a rule is that if the items of X appears in a sequence in any order, they will be followed by the items of Y in any order. The support of a rule is the number of sequence containing the rule divided by the total number of sequences. The confidence of a rule is the number of sequence containing the rule divided by the number of sequences containing its antecedent. The goal of sequential rule mining is to discover all sequential rules having a support and confidence no less than two thresholds given by the user named “minsup” and “minconf” . For example, on the right part of the following figure some sequential rules are shown for minsup=0.5 and minconf=0.5, discovered by the RuleGrowth algorithm.

For more details about sequential rule mining, this paper presents the RuleGrowth algorithm (I’m the author of that paper).

Fournier-Viger, P., Nkambou, R. & Tseng, V. S. (2011). RuleGrowth: Mining Sequential Rules Common to Several Sequences by Pattern-Growth. Proceedings of the 26th Symposium on Applied Computing (ACM SAC 2011). ACM Press, pp. 954-959.

Pattern mining is often viewed as techniques to explain the past by discovering patterns. However, patterns found can also be used for prediction. As an example of application, the following paper shows how sequential rules can be used for predicting the next webpages that will be visited by users on a website, with a higher accuracy than using sequential patterns (named “classic sequential rules” in that paper):

Fournier-Viger, P. Gueniche, T., Tseng, V.S. (2012). Using Partially-Ordered Sequential Rules to Generate More Accurate Sequence Prediction. Proc. 8th International Conference on Advanced Data Mining and Applications (ADMA 2012), Springer LNAI 7713, pp. 431-442.

Implementations of pattern mining algorithms

If you want to know more information about pattern mining, most of the general data mining books such as the book of Han & Kamber and Tan, Steinbach & Kumar have at least one chapter devoted to pattern mining.

If you want to test pattern mining algorithms, I recommend to have a look at the SPMF data mining library (I’m the project founder), which offers the Java source code of more than 55 pattern mining algorithms, with simple examples, and a simple command line and graphical user interface for quickly testing the algorithms.

Conclusion

Pattern mining is a subfield of data mining that has been active for more than 20 years, and is still very active. Pattern mining algorithms have a wide range of applications. For example, the Apriori algorithm can also be applied to optimized bitmap index of data wharehouse. In this blog post, I have given two examples to give a rough idea of what is the goal of pattern mining. However, note that it is not possible to summarize twenty years of research in a single blog post!

Hope that this post was interesting! If you like this blog, you can subscribe to my RSS Feed or Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

–
P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

Posted in Data Mining, Research | Tagged association rule, itemset, itemset mining, pattern mining, sequential patterns | 28 Comments

The importance of sociability for researchers

Posted on 2013-10-07 by Philippe Fournier-Viger

There are several characteristics required to become a great researcher. Today, I will discuss one of them that is sometimes overlooked. It is sociability. Sociability means to build an maintain relationships with other researchers. The nature of the social relationships can vary. It can be for example to co-author a paper with another researcher, to give feedback on his project, or just to discuss with other researchers at conferences or at university.

Why sociability is important? After all, a researcher can become famous by publishing his or her ideas alone. This is true. However, building relationships with other researchers will bring several benefits to a researcher’s career.

Let’s consider the case of a M.Sc student or Ph.D. student. A student can work 100 % of his time on his own project. In this case, he will publish a few articles. However, if a student work 80 % on his project and spend 20 % of his time to participate in the project of another student, he will publish more articles. The sociability of the research advisor of the student is also important. If a student has a research advisor that has good relationships, some opportunities may arises such as to participate in books, workshop organization, etc.

Now, let’s consider the case of a researcher who got his Ph.D. Relationships are also important, and probably more than for a student! If a researcher has good relationships, he may be invited to participate in some conference committees, to participate to grant proposals, be invited to give a talk at various universities, etc. More opportunities will be available to the researcher. For a professor, sociability may also help to find good students and good students write good papers, which bring good grants, and so on.

A good advice is therefore to try to build good relationships with other researchers. This can be done by attending academic conferences, by discussing with colleagues, sending e-mails to other researchers, etc.

Today, I was viewing my automatically generated profile on ArnetMiner ( http://arnetminer.org/person/philippe-fournier-viger-351920.html ) , which offers eight characteristics to asses a researcher’s work : Activity, Citations, H-Index, G-Index, Sociability, Diversity, Papers and Rising Star. Here is my chart:

characteristics of a researcher (arnetminer)

H-Index, G-Index, Citation, Papers, Rising Star and Activity are measures mainly derived from the number of publications and the number of citations. Diversity is about the number of different topics that a researcher has worked on. Finally, sociability is measured based on the number of co-authors, which gives a measure of sociability, but exclude other non-measurable relationships between researchers.

In my case, the diversity measure is high because I have worked on several topics ranging from intelligent agents, cognitive architectures, intelligent tutoring systems before focusing more recently on data mining.

Hope that this post was interesting! I just want to share a few thought about this! If you have some other ideas, please share them in the comment section! If you like this blog, you can subscribe to my RSS Feed or Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Update in 2022: here is how my ArnetMiner page looks in 2022:

Update in September, 2023:

–
P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

Posted in General, Research | Tagged Research, sociability | 2 Comments

How to encourage data mining researchers to share their source code and datasets?

Posted on 2013-09-20 by Philippe Fournier-Viger

A few months ago, I wrote a popular blog post on this blog about why it is important to publish source code and datasets for researchers“. I explained several advantages that researchers can get by sharing the source code of their data mining algorithms such as: (1) other researchers will save time because they don’t need to re-implement your algorithm(s), (2) other researchers are more likely to integrate your algorithms in their software and cite your research papers if you publish your source code and (3) people will compare with the original version of your algorithm rather than their own perhaps faulty or unoptimized implementation. I gave as example, my own open-source library for data mining named SPMF, which was cited in more than 50 papers from researchers all over the world. I have spend quite some time to start this project. But it has helped many researchers all over the world. So there is obviously some benefits to share source code and datasets. But still, why few data mining researchers share their source code and datasets? I think that we should attempt to change this. So the question that I want to ask today is What should we do to encourage researchers to publish their source code and datasets?

There can be many different answers to this question. I will present some tentative answers and I hope that readers will also contribute by adding other ideas in the comment section.

First, I think that a good solution would be that the main data mining journals or conferences would add special tracks for papers who publish their source code and datasets. For example, some popular conferences like KDD already have a general track, an industry track and perhaps some other tracks. If a special track was added for authors who submit their source code/datasets such that they would have a slightly higher chance of being accepted, I think that it would be a great incentive.

Second, an idea is to make special workshops or implementation competition where researchers have to share their code/datasets. For example, in the field of frequent pattern mining, there was two famous implementation competitions FIMI2003 and FIMI2004 (http://fimi.ua.ac.be/) where about 20 algorithms implementations where released. In this case, what was released was not source code for all algorithms, but at least the binaries were released. After 2004, no implementation workshop was done on this topic, and therefore very few authors have released implementations of the newer algorithms on this topic, which is a pity. If there was more workshops like this, it would encourage researchers to share their code and datasets.

Third, one could imagine creating some organized repositories or libraries so that researchers could share their source code and datasets. There exists some. But not many and they are not very popular.

Fourth, one could think of creating incentives for students/researchers at universities who release their data and code, or even to force them to release their code/data. For example, a department could request that all their student publish their code and data. Another alternative would be that funding agencies would request that code and data would be shared.

That is all the ideas that I have for now. If you have some other ideas, please share them in the comment section!

–
P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

If you like this blog, you can subscribe to the RSS Feed or Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in Data Mining, Research | Tagged data mining, dataset, open-source, source code | Leave a comment

The importance of constraints in data mining

Posted on 2013-09-17 by Philippe Fournier-Viger

Today, I will discuss an important concept in data mining which is the use of constraints.

Data mining is a broad field incorporating many different kind of techniques for discovering unexpected and new knowledge from data. Some main data mining tasks are: (1) clustering, (2) pattern mining, (3) classification and (4) outlier detection.

Each of these main data mining tasks offers a set of popular algorithms. Generally, the most popular algorithms are defined to handle a general and simple case that can be applied in many domains.

For example, consider the task of sequential pattern mining proposed by Agrawal and Srikant (1995). Without going into details, it consists of discovering subsequences that appear frequently in a set of sequences of symbols. In the original problem definition, a user only has two parameters: (1) a set of sequences and (2) a minimum frequency threshold indicating the minimal frequency that a pattern should have, to be found.

But to apply a data mining algorithm in a real application often require to consider specific characteristics of the application. One way to do that is to add the concept of constraints. For example, in the past, I have done a research project where I have applied a sequential pattern mining algorithm to discover frequent sequences of actions performed by learners using an e-learning system (pdf). I first used a popular classical algorithm named PrefixSpan but I quickly found that the patterns found were uninteresting because. To filter uninteresting patterns, I have modified the algorithm to add several constraints such as:
– the minimum/maximum length of a pattern in sequences where timestamps are used
– the minimum “gap” between two elements of a subsequence
– removing redundancy in results
– adding the notion of annotations and context to sequences
– …

By modifying the original algorithm to add constraints specific to the application domain, I got much better results (and for this work on e-learning, I received the best paper award at MICAI 2008). The lesson from this example is that it is often necessary to adapt existing algorithms by adding constraints or other domain specific ideas to get good results that are tailored to an application domain. In general, it is a good idea to start with a classical algorithm to see how it works and its limitations. Then, one can modify the algorithm or look for some existing modifications that are better suited for the application.

Lastly, another important point is for data mining programmers. There is two ways to integrate constraints in data mining algorithms. First, it is possible to add constraints by performing post-processing on the result of a data mining algorithm. The advantage is that it is easy to implement. Second, it is possible to add constraints directly in the mining algorithms so as to use the constraints to prune the search space and improve the efficiency of the algorithms. This is more difficult to do, but it can provide much better performance in some cases. For example, in most frequent pattern mining algorithms for example, it is well-known that using constraints can greatly increase the efficiency in terms of runtime and memory usage while greatly reducing the number of patterns found.

That is what I wanted to write for today. If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

–
P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

Posted in Data Mining | Tagged algorithms, constraints, data mining | Leave a comment

How to measure the memory usage of data mining algorithms in Java?

Posted on 2013-08-02 by Philippe Fournier-Viger

Today, I will discuss the topic of accurately evaluating the memory usage of data mining algorithms in Java. I will share several problems that I have discovered with memory measurements in Java for data miners and strategies to avoid these problems and get accurate memory measurements.

In Java, there is an important challenge for making accurate memory measurement. It is that the programmer does not have the possibility to control the memory allocation. In Java, when a program does not hold references to an object anymore, there is no guarantee that the memory will be freed, immediately. This is because in Java, the Garbage Collector (GC) is responsible for freeing the memory and he generally use a lazy approach. In fact, during extensive CPU usage, I have often noticed that the GC waits until the maximum memory limit is reached before starting to free memory. Then, when the GC starts its work, it may considerably slow down the speed of your algorithm thus causing inaccurate execution time measurements. For example, consider the following charts.

In these charts, I have compared the execution time (left) and memory usage (right) of two data mining algorithms named CMRules and CMDeo. When I have performed the experiment, I have noticed that as soon as CMDeo reached the 1 GB memory limit (red line), it suddenly became very slow because of garbage collection. This would create a large increase in execution time on the chart. Because this increase is not due to the algorithm itself but due to the GC, I decided to (1) not include memory measurements for |S| > 100K for CMDeo in the final chart and (2) to mention in the research article that it was because of the GC that no measurement is given. This problem would not happen with a programming language like C++ because the programmer can decide when the memory is freed (there is no GC).

To avoid the aforementioned problem, the lessons that I have learned is to either (1) add more memory to your computer (or increase the memory allocated to your Java Virtual Machine) or (2) choose an experiment where the maximum memory limit will not be reached to provide a fair comparison of the algorithms.

To increase the memory limit of the JVM (Java Virtual Machine), there is a command line parameter called -xmx that can work or not depending on your Java version. For example, if you want to launch a Jar file called spmf.jar with 1024 megabytes of RAM, you can do as follows.

java -Xmx1024m -jar spmf.jar

If you are running your algorithms from a development environment such as Eclipse, the XMX parameter can also be used:

Go in the menu Run > Run Configurations > then select the class that you want to run.
Go to the “Arguments” tab > Then paste the following text in the “VM Arguments” field: -Xmx1024${build_files}m
Then press “Run“.

Now that I have discussed the main challenges of memory measurement in Java, I will explain how to measure the memory usage accurately in Java. There are a few ways to do it and it is important to understand when they are best used.

Method 1. The first way is to measure the memory at two different times and to subtract the measurements. This can be done as follows:

double startMemory = (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024d / 1024d; ..... double endMemory = (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024d / 1024d

System.out.println(" memory :" + endMemory - startMemory);

This approach provides a very rough estimate of the memory usage. The reason is it does not measure the real amount of memory used at a given moment because of the GC. In some of my experiments, the amount of memory measured by this method even reached up to 10 times the amount of memory really used. However, when comparing algorithms, this method can still give a good idea of which algorithm has better memory usage. For this reason, I have used this method in a few research articles where the goal was to compare algorithms.

Method 2. The second method is designed to calculate the memory used by a Java object. For data miners, it can be used to assess the size of a data structure, rather than observing the memory usage of an algorithm over a period of time. For example, consider the FPGrowth algorithm. It uses a large data structure that is named the FPTree. Measuring the size of an FPTree accurately is very difficult with the first method, for the reason mention previously. A solution is to use Method 2, which is to serialize the data structure that you want to measure as a stream of bytes and then to measure the size of the stream of bytes. This method give a very close estimate of the real size of an object. This can be done as follows:

MyDataStructure myDataStructure = ....

ByteArrayOutputStream baos = new ByteArrayOutputStream(); ObjectOutputStream oos = new ObjectOutputStream(baos); oos.writeObject(myDataStructure); oos.close()

System.out.println("size of data structure : " + baos.size() / 1024d / 1024d + " MB");;

With Method 2, I usually get some accurate measurements.For example, recently I wanted to estimate the size of a new data structure that I have developed for data mining. When I was using Method 1, I got a value close to 500 MB after the construction of the data structure. When I used Method 2, I got a much more reasonable value of 30 MB. Note that this value can still be a little bit off because some additional information can be added by Java when an object is serialized.

Method 3. There is an alternative to Method 2 that is reported to give a better estimate of the size of an object. It requires to use the Java instrumentation framework. The downside of this approach is that it requires to run an algorithm by using the command line with a Jar file that need to be created for this purpose, which is more complicated to do than the two first methods. This method can be with Java >= 1.5. For more information on this method, see this tutorial.

Other alternatives: There exists other alternatives such as using a memory profiler for observing in more details the behavior of a Java program in terms of memory usage. I will not discuss it in this blog post.

That is what I wanted to write for today. If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

—
P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

Posted in Data Mining, Programming, Research | Tagged comparison, data mining, experiment, java, memory, performance | 1 Comment

How to make good looking charts for research papers?

Posted on 2013-07-29 by Philippe Fournier-Viger

Charts are often used in research papers to present experimental results. Today, I will discuss how to make good looking charts for presenting research results. I will not cover everything about this topic. But I will explain some key ideas.

If you are using Excel to make charts for your research papers, one of the most common mistakes is to use the default chart style. The default style is very colorful with large lines. It is thus more appropriate for a PowerPoint presentation than a research paper. Charts appearing in research paper are most of the time printed in black and white and generally have to be small to save space. For example, below, I show a chart made with the default Excel style (left) and how I have tweaked its appearance to add it in one of my research papers.

The key modifications that I have made are:

Data line size = 0.75 pts (looks better when printed and can see more clearly the various lines)
Change the font size to 8 pts (enough for a research paper)
No fill color for markers
Marker line size = 0.75 pts
No border line for the whole chart
Remove the horizontal lines inside the chart area
Everything is black and white (looks better when printed) such as axis lines, markers, data lines, etc.

Besides, it is also important to:

Make sure that the units on each axis appear correctly.
If necessary, change the interval of minor and major units and the minimum and maximum values for each axis so that no space is wasted and that unit labels appear correctly.
Make sure that all axis have labels indicating the units (e.g. “Execution time (s)”).
Make sure that the chart has a legend.
If necessary change the number format for each axis. For example, in the previous example, I have previously changed the number format of the X axis to “0 K” in the axis options of Excel, so that numbers such as 1,000,000 appears as 1000K instead. This saves a lot of space.

Do not convert charts to bitmaps. Another common mistake is to convert charts to image files before inserting them in a Word document. Unless, you create a very high resolution image file, the printing quality will not be very good. A better solution is to directly copy the Excel chart into the Word document. If you do like that, when printing or generating a PDF of your document, the chart will be considered as vector graphics rather than as a bitmap. This will greatly enhance the appearance of your chart when it is printed.

Alternatives to Excel: There are also several alternatives to Excel such as R etc.

This is what I wanted to wrote for today. Obviously, more things could be said on this topic. But my goal was to highlight the importance of customizing the appearance of charts. In this post, I have shown an example. However, I recommend to look at charts from other research papers in your field to see what is the most appropriate style for your field.

If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in General, Research | Tagged chart, excel, paper, Research, writing | 5 Comments

Top mistakes when writing a research paper

Why data mining researchers should evaluate their algorithms against state-of-the-art algorithms?

Brief report about the ADMA 2013 conference

Conference reviewers procrastinate?

An introduction to frequent pattern mining

The importance of sociability for researchers

How to encourage data mining researchers to share their source code and datasets?

The importance of constraints in data mining

How to measure the memory usage of data mining algorithms in Java?

How to make good looking charts for research papers?

Archives

Categories

Recent Posts

Recent Comments

Number of visitors:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Archives

Categories

Recent Posts

Recent Comments

Tag cloud

Number of visitors: