The importance of sociability for researchers

There are several characteristics required to become a great researcher. Today, I will discuss one of them that is sometimes overlooked. It is sociability.  Sociability means to build an maintain relationships with other researchers.   The nature of the social relationships can vary. It can be for example to co-author a paper with another researcher, to give feedback on his project, or just to discuss with other researchers at conferences or at university.

Why sociability is important?  After all, a researcher can become famous by publishing his or her ideas alone. This is true. However, building relationships with other researchers will bring several benefits to a researcher’s career.

Let’s consider the case of a M.Sc student or Ph.D. student.  A student can work 100 % of his time on his own project. In this case, he will publish a few articles.  However, if a student work 80 % on his project and spend 20 % of his time to participate in the project of another student, he will publish more articles.  The sociability of the research advisor of the student is also important. If a student has a research advisor that has good relationships, some opportunities may arises such as to participate in books, workshop organization, etc.

Now, let’s consider the case of a researcher who got his Ph.D.  Relationships are also important, and probably more than for a student! If a researcher has good relationships, he may be invited to participate in some conference committees, to participate to grant proposals, be invited to give a talk at various universities, etc. More opportunities will be available to the researcher.  For a professor, sociability may also help to find good students and good students write good papers, which bring good grants, and so on.

A good advice is therefore to try to build good relationship with other researchers.

Today, I was viewing my automatically generated profile on ArnetMiner ( http://arnetminer.org/person/philippe-fournier-viger-351920.html ) , which offers eight characteristics to asses a researcher’s work : Activity, Citations, H-Index, G-Index, Sociability, Diversity, Papers and Rising Star. Here is my chart:

characteristics of a researcher (arnetminer)

H-Index, G-Index, Citation, Papers, Rising Star and Activity are measures mainly derived from the number of publications and the number of citations.   Diversity is about the number of different topics that a researcher has worked on. Finally, sociability is measured based on the number of co-authors, which gives a measure of sociability, but exclude other non-measurable relationships between researchers.

In my case, the diversity measure is high because I have worked on several topics ranging from intelligent agents, cognitive architectures, intelligent tutoring systems before focusing more recently on data mining.

Update 2017/03:  Almost four years have passed since I wrote this blog post.  Here is a brief update that shows how the profile of a researcher can evolve over time. Below is a screenshot of my statistics generated by ArnetMiner, and citation count according to Google Scholar:
updated_metrics

As it can be seen above, my scores for each dimension  calculated by ARMiner has  quite improved in recent years as well as the number of citations.

Hope that this post was interesting!  I just want to share a few thought about this! If you have some other ideas, please share them in the comment section! If you like this blog, you can subscribe to my RSS Feed or Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!


P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

 

How to encourage data mining researchers to share their source code and datasets?

A few months ago, I wrote a popular blog post on this blog about why it is important to publish source code and datasets for researchers“.  I explained several advantages that researchers can get by sharing the source code of their data mining algorithms such as: (1) other researchers will save time because they don’t need to re-implement your algorithm(s), (2) other researchers are more likely to integrate your algorithms in their software and cite your research papers if you publish your source code and (3) people will compare with the original version of your algorithm rather than their own perhaps faulty or unoptimized implementation. I gave as example, my own open-source library for data mining named SPMF, which was cited in more than 50 papers from researchers all over the world. I have spend quite some time to start this project. But it has helped many researchers all over the world. So there is obviously some benefits to share source code and datasets. But still, why few data mining researchers share their source code and datasets? I think that we should attempt to change this. So the question that I want to ask today is What should we do to encourage researchers to publish their source code and datasets?

code

There can be many different answers to this question. I will present some tentative answers and I hope that readers will also contribute by adding other ideas in the comment section.

First, I think that a good solution would be that the main data mining journals or conferences would add special tracks for papers who publish their source code and datasets. For example, some popular conferences like KDD already have a general track, an industry track and perhaps some other tracks. If a special track was added for authors who submit their source code/datasets such that they would have a slightly higher chance of being accepted, I think that it would be a great incentive.

Second, an idea is to make special workshops or implementation competition where researchers have to share their code/datasets. For example, in the field of frequent pattern mining, there was two famous implementation competitions FIMI2003 and FIMI2004 (http://fimi.ua.ac.be/) where about 20 algorithms implementations where released. In this case, what was released was not source code for all algorithms, but at least the binaries were released. After 2004, no implementation workshop was done on this topic, and therefore very few authors have released implementations of the newer algorithms on this topic, which is a pity. If there was more workshops like this, it would encourage researchers to share their code and datasets.

Third, one could imagine creating some organized repositories or libraries so that researchers could share their source code and datasets.  There exists some. But not many and they are not very popular.

Fourth, one could think of creating incentives for students/researchers at universities who release their data and code, or even to force them to release their code/data. For example, a department could request that all their student publish their code and data.  Another alternative would be that funding agencies would request that code and data would be shared.

That is all the ideas that I have for now.  If you have some other ideas, please share them in the comment section!


P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

If you like this blog, you can subscribe to the RSS Feed or Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

The importance of constraints in data mining

Today, I will discuss an important concept in data mining which is the use of constraints.

constraints

Data mining is a broad field incorporating many different kind of techniques for discovering unexpected and new knowledge from data. Some main data mining tasks are: (1) clustering, (2) pattern mining, (3) classification and (4) outlier detection.

Each of these main data mining tasks offers a set of popular algorithms. Generally, the most popular algorithms are defined to handle a general and simple case that can be applied in many domains.

For example, consider the task of sequential pattern mining proposed by Agrawal and Srikant (1995). Without going into details, it consists of discovering subsequences that appear frequently in a set of sequences of symbols. In the original problem definition, a user only has two parameters: (1) a set of sequences and (2) a minimum frequency threshold indicating the minimal frequency that a pattern should have, to be found.

But to apply a data mining algorithm in a real application often require to consider specific characteristics of the application. One way to do that is to add the concept of constraints. For example, in the past, I have done a research project where I have applied a sequential pattern mining algorithm to discover frequent sequences of actions performed by learners using an e-learning system (pdf). I first used a popular classical algorithm named PrefixSpan but I quickly found that the patterns found were uninteresting because. To filter uninteresting patterns, I have modified the algorithm to add several constraints such as:
– the minimum/maximum length of a pattern in sequences where timestamps are used
– the minimum “gap” between two elements of a subsequence
– removing redundancy in results
– adding the notion of annotations and context to sequences
– …

By modifying the original algorithm to add constraints specific to the application domain, I got much better results (and for this work on e-learning, I received the best paper award at MICAI 2008). The lesson from this example is that it is often necessary to adapt existing algorithms by adding constraints or other domain specific ideas to get good results that are tailored to an application domain. In general, it is a good idea to start with a classical algorithm to see how it works and its limitations. Then, one can modify the algorithm or look for some existing modifications that are better suited for the application.

Lastly, another important point is for data mining programmers. There is two ways to integrate constraints in data mining algorithms. First, it is possible to add constraints by performing post-processing on the result of a data mining algorithm. The advantage is that it is easy to implement. Second, it is possible to add constraints directly in the mining algorithms so as to use the constraints to prune the search space and improve the efficiency of the algorithms. This is more difficult to do, but it can provide much better performance in some cases. For example, in most frequent pattern mining algorithms for example, it is well-known that using constraints can greatly increase the efficiency in terms of runtime and memory usage while greatly reducing the number of patterns found.

That is what I wanted to write for today. If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!


P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

How to measure the memory usage of data mining algorithms in Java?

Today, I will discuss the topic of accurately evaluating the memory usage of data mining algorithms in Java. I will share several problems that I have discovered with memory measurements in Java for data miners and strategies to avoid these problems and get accurate memory measurements.

In Java, there is an important challenge for making accurate memory measurement. It is that the programmer does not have the possibility to control the memory allocation. In Java, when a program does not hold references to an object anymore, there is no guarantee that the memory will be freed, immediately. This is because in Java, the Garbage Collector (GC) is responsible for freeing the memory and he generally use a lazy approach. In fact, during extensive CPU usage, I have often noticed that the GC waits until the maximum memory limit is reached before starting to free memory.  Then, when the GC starts its work, it may considerably slow down the speed of your algorithm thus causing inaccurate execution time measurements.  For example, consider the following charts.

measuring memory of data mining algorithms in Java

In these charts, I have compared the execution time (left) and memory usage (right) of two data mining algorithms named CMRules and CMDeo. When I have performed the experiment, I have noticed that as soon as CMDeo reached the 1 GB memory limit (red line), it suddenly became very slow because of garbage collection. This would create a large increase in execution time on the chart. Because this increase is not due to the algorithm itself but due to the GC, I decided to (1) not include memory measurements for |S| > 100K for CMDeo in the final chart and (2) to mention in the research article that it was because of the GC that no measurement is given.  This problem would not happen with a programming language like C++  because the programmer can decide when the memory is freed (there is no GC).

To avoid the aforementioned problem, the lessons that I have learned is to either (1) add more memory to your computer (or increase the memory allocated to your Java Virtual Machine) or (2) choose an experiment where the maximum memory limit will not be reached to provide a fair comparison of the algorithms.

To increase the memory limit of the JVM (Java Virtual Machine), there is a command line parameter called -xmx that can work or not depending on your Java version.  For example, if you want to launch a Jar file called spmf.jar with 1024 megabytes of RAM, you can do as follows.

java -Xmx1024m -jar spmf.jar

If you are running your algorithms from a development environment such as Eclipse, the XMX parameter can also be used:

  • Go in the menu Run > Run Configurations >  then select the class that you want to run.
  • Go to the “Arguments” tab >  Then paste the following text in the “VM Arguments” field:      -Xmx1024${build_files}m
  • Then press “Run“.

Now that I have discussed the main challenges of memory measurement in Java, I will explain how to measure the memory usage accurately in Java.  There are a few ways to do it and it is important to understand when they are best used.

Method 1. The first way is to measure the memory at  two different times and to subtract the measurements. This can be done as follows:

double startMemory = (Runtime.getRuntime().totalMemory() -  Runtime.getRuntime().freeMemory())
/ 1024d / 1024d;
.....
double endMemory = (Runtime.getRuntime().totalMemory() -  Runtime.getRuntime().freeMemory())
/ 1024d / 1024d

System.out.println(" memory :" + endMemory - startMemory);

This approach provides a very rough estimate of the memory usage. The reason is it does not measure the real amount of memory  used at a given moment because of the GC. In some of my experiments, the amount of memory measured by this method even reached up to 10 times the amount of memory really used. However, when comparing algorithms, this method can still give a good idea of which algorithm has better memory usage. For this reason, I have used this method in a few research articles where the goal was to compare algorithms.

Method 2. The second method is designed to calculate the memory used by a Java object. For data miners, it can be used to assess the size of a data structure, rather than observing the memory usage of an algorithm over a period of time.  For example, consider the FPGrowth algorithm. It uses a large data structure that is named the FPTree.  Measuring the size of an FPTree accurately is very difficult with the first method, for the reason mention previously. A solution is to use Method 2, which is to serialize the data structure that you want to measure as a stream of bytes and then to measure the size of the stream of bytes. This method give a very close estimate of the real size of an object.  This can be done as follows:

MyDataStructure myDataStructure = ....

ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(myDataStructure);
oos.close()

System.out.println("size of data structure : " + baos.size() / 1024d / 1024d + " MB");;

With Method 2, I usually get some accurate measurements.For example, recently I wanted to estimate the size of a new data structure that I have developed for data mining. When I was using Method 1, I got a value close to 500 MB after the construction of the data structure.  When I used Method 2, I got a much more reasonable value of 30 MB.  Note that this value can still be a little bit off because some additional information can be added by Java when an object is serialized.

Method 3. There is an alternative to Method 2 that is reported to give a better estimate of the size of an object. It requires to use the Java instrumentation framework. The downside of this approach is that it requires to run an algorithm by using the command line with a Jar file that need to be created for this purpose, which is more complicated to do than the two first methods. This method can be with Java >=  1.5.  For more information on this method, see this tutorial.

Other alternatives: There exists other alternatives such as using a memory profiler for observing in more details the behavior of a Java program in terms of memory usage. I will not discuss it in this blog post.

That is what I wanted to write for today.   If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!


P. Fournier-Viger is the founder of the Java open-source  data mining software SPMF, offering more than 50 data mining algorithms.

How to make good looking charts for research papers?

Charts are often used in research papers to present experimental results. Today, I will discuss how to make good looking charts for presenting research results. I will not cover everything about this topic. But I will explain some key ideas.

If you are using Excel to make charts for your research papers, one of the most common mistakes is to use the default chart style.  The default style is very colorful with large lines. It is thus more appropriate for a PowerPoint presentation than a research paper. Charts appearing in research paper are most of the time printed in black and white and generally have to be small to save space.  For example, below, I show a chart made with the default Excel style (left) and how I have tweaked its appearance to add it in one of my research papers.

how to make chart for research paper

The key modifications that I have made are:

  • Data line size = 0.75 pts  (looks better when printed and can see more clearly the various lines)
  • Change the font size to 8 pts  (enough for a research paper)
  • No fill color for markers
  • Marker line size = 0.75 pts
  • No border line for the whole chart
  • Remove the horizontal lines inside the chart area
  • Everything is black and white (looks better when printed) such as axis lines, markers, data lines, etc.

Besides, it is also important to:

  • Make sure that the units on each axis appear correctly.
  • If necessary, change the interval of minor and major units and the minimum and maximum values for each axis so that no space is wasted and that unit labels appear correctly.
  • Make sure that all axis have labels indicating the units (e.g.  “Execution time (s)”).
  • Make sure that the chart has a legend.
  • If necessary change the number format for each axis. For example, in the previous example, I have previously changed the number format of the X axis to “0 K” in the axis options of Excel, so that numbers such as 1,000,000 appears as 1000K instead. This saves a lot of space.

Do not convert charts to bitmaps. Another common mistake is to convert charts to image files before inserting them in a Word document. Unless, you create a very high resolution image file, the printing quality will not be very good.  A better solution is to directly copy the Excel chart into the Word document. If you do like that, when printing or generating a PDF of your document, the chart will be considered as vector graphics rather than as a bitmap. This will greatly enhance the appearance of your chart when it is printed.

Alternatives to Excel: A reader (Antonio) has sent me a great tutorial about using R for making charts, as an alternative to Excel. I think that it looks like a great alternative, that could also be used with LaTeX. Also, a link about how to use R in Excel (by Antonio), for those interested.

This is what I wanted to wrote for today.  Obviously, more things could be said on this topic. But my goal was to highlight the importance of customizing the appearance of charts. In this post, I have shown an example. However, I recommend to look at charts from other research papers in your field to see what is the most appropriate style  for your field.

If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

How to search for a research advisor by e-mail?

In this blog post, I will discuss how to search for a research advisor by e-mail.

email

Today, I received an e-mail from a Ph.D student from abroad asking to work with me as a post-doc on the topic of “Web Services”.  Let’s have a look at the e-mail and then I will discuss the problems with this  e-mail.

From:  XXXXXXXX@researchabroad.net

Dear Professor Fournier-Viger,

My name is XXXXXX. I am interested in many areas, including but not limited to “XXXXX”. I am very interested in applying for a postdoctoral position in your lab.

I completed my Ph.D XXXXXXXX majored in XXXX, from XXXXX University, in XXXXXX. Before that, I focused on XXXXXXX both in Master and Bachelor studies.

My research goal is to provide a novel service model XXXXXXXXXXX and so on.

I have many years’ experience in service computing research. And the areas I can pursue is as following,
Mulit-agent research;
Services computing research;
XXXXXXXXX
XXXXXXXXXXXx
XXXXXXXXXX
XXXXXXXXXXXX

I would be grateful if you would give me the opportunity to work in your group. The attached is my CV for your review.
I am eagerly looking forward to hearing from you.

Sincerely yours,
XXXXXXXXXX

When I read this e-mail, I see right away that this message was probably sent to hundreds or thousands of professors. The reason why I get this impression is that I’m not working on “web services” and that the student write about HIS research interests instead of talking about why he is interested in working with me. When I receive this kind of e-mail, I usually delete it and I know that several other professors in other universities do the same.  On the other hand, if I receive a personalized message from a student that explain why he wants to work for me, I will take the time to read it carefully and answer it.

The advice that I wanted to give in this post is that to be successful when searching for a research advisor by e-mail, it is important to write personalized e-mails to each professor, and to choose professors related to your field. It takes more time. But it will be more successful. This is what I did when looking for a post-doc position when I was a Ph.D. student and it worked very well.

This is what I wanted to write for today.  If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

What are the steps to implement a data mining algorithm?

In this post, I will discuss what are the steps that I follow to implement a data mining algorithm.

steps to implement data mining algorithms

The subject of this post comes from a question that I have received by e-mail recently, and I think that it is interesting to share the answer. The question was :

When we do he programming I’ve observed a good programmer always makes a flow chart, data flow diagram, context diagram etc. to make the programming error free.  (…) In your implementation of XXXX  algorithm, have you performed any procedures as discussed above? (…) Can you explain me what are the basic procedures that you have applied to implement XXXX algorithm?

When I implement a data mining algorithm, I don’t do any flowchart or UML design because I consider that a single data mining algorithm is a small project.  If I was programming  larger software program, I would think about object-oriented design and UML. But for a single algorithm, it does not need to be an object oriented design. Actually, what is the most important for a data mining algorithm is that the algorithm produces the correct result, is fast and preferably that the code is well-documented and clean.

Step 1: understanding the algorithm. For implementing a data mining algorithm, the first step  that I perform is to read the research paper describing it and make sure that I understand it well.  Usually, I need to read the paper a few times to understand it.  If there is an example in the paper I will try to understand it and later I will use this example to test the algorithm to see if my implementation is correct. After I read the paper a few times, I may still not  understand some details about the algorithm. But this is ok. There is often some tricky details that you may only understand when doing the implementation because the authors do not always give all the details in the paper, sometimes due to the lack of space. Especially, it is common that authors do not describe all the optimizations and data structures that they have used in a research paper.

Step 2: implementing a first draft of the algorithm step by step. Then, I start to implement the algorithm.  To implement the algorithm I print the pseudocode and i try to implement it.   For an algorithm that takes a file as input, I will first work on the code for reading the input file.  I will test this code extensively to make sure that I read the input file correctly. Then, I will add additional details to perform the first operations of the algorithm. I will check if the intermediary result is correct by comparing with the example in the paper. If it is not correct I will debug and maybe read the paper again to make sure that I did not make a mistake because I did not understand something in the paper.  Then I will continue until the algorithm is fully implemented.

Step 3: testing with other input files. When my implementation becomes correct, I will try with a few more examples, to make sure that it is not correct for a single example but that it can provide the correct result for other input files.

Step 4: cleaning the code. After that, I will clean my code because the first draft is likely not pretty.

Step 5: optimizing the code. Then I will try to optimize the algorithm in terms of (1) using better data structures, (2) simplifying the code by removing unecessary operations, etc.  For example, for my implementation of PrefixSpan in my SPMF data mining software, I first made a very simple implementation without an optimization called  pseudo-projection that is described in the paper It was very slow. After my implementation was correct, I took a few days to optimize it.  For example, I added the pseudo-projection, I also added code for another optimization which is to remove infrequent items after the first input file scan,  I removed some unnecessary code that I had left,  I reorganized some code,  I added some comments, etc.

Step 6: Comparison of the performance with other implementations of the same algorithm / peer review. After my code is optimized, as an optional sixth step, I may compare the performance of my implementation with other implementations of the same algorithm if some are available on the Internet in the same programming language.If my implementation is slower, I may look at the source code of the other implementation to see if there is some ideas that I have not thought of that could be used to further optimize my code.  I may also ask some of my friends or colleagues to review my code. Another good way is to not show the code to your colleague but just to explain them the main idea to get their feedback. Discussing with other people is a good way to learn.

It takes times… Note that being good at programming data mining algorithms takes time. For example, let me tell you about my story. The first time that I implemented data mining algorithms was in december 2007. I implemented the Apriori algorithm for a course project at university.  My implementation was terrible and slow…  But it generated the correct result.  I then implemented PrefixSpan in 2008.  At that time, my code was better because I was gaining some experience on implementing this kind of algorithms.  Then in 2010, I read my Apriori and PrefixSpan code again and I still found some problems and I optimized them again.  What I want to say here is that it is normal that the first implementation(s) of data mining algorithms that one person makes may not be very good.  But after implementing a few algorithms, it becomes much easier.  Now,  we are in 2013 and I have implemented more than 45 algorithms in my open-source SPMF Java data mining software!

This is what I wanted to write for today.  Hope that you enjoyed this post. If you want to read more on this topic, you may be interested by my post on How to be a good data mining programmer.   Lastly, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

Choosing data structures according to what you want to do

Today, I write a post about programming. I want to share a simple but important idea for writing optimized code. The idea is to choose data structures according to what you want to do instead of what you want to store. This idea is simple. But I write this post because it addresses a common beginner’s misconception which is to think of data structures solely in terms of what they can store.

For example, a beginner programmer may think that he should use an array or a list because s/he want to store some items in a given order. Or simply because s/he wants to store a set of single values.   To store two dimensional data, a simple idea is to use a two dimensional array, etc. That is a simple reasoning that is fine for implementing a program that works.

However, to write an optimized program, it is important to think further about how the data will be used. For example, consider that you need to write a program where you have to store a long list of integer values that is updated periodically (add and remove) and where you want to quickly find the minimum and maximum value.  If a programmer thinks about what he need to store, s/he may decide to use an array. If the programmer thinks in terms of what he want to do with the data, s/he may decide to use a list (an array that is dynamically resized) because add and remove operations will be performed periodically.  This could be a better solution. However, if the programmer thinks further in terms of what he want to do with the data, he may decide to use a red-black tree, which guarantees a O(log(n)) worst-case time cost for the four operations add, remove, minimum and maximumThis could be a much better solution!

Is it therefore important to take the time to find appropriate data structures if one’s wants to write optimized code.  Also note that the execution time is important but the memory usage is also sometimes very important.

To show you an example of what is the impact of choosing appropriate data structures on performance, I here compare three versions of TopKRules, an algorithm for mining top-k association rules in a transaction database. TopKRules needs to store a list of candidates and a list of k best rules and perform add, remove, minimum and maximum operations.  Furthermore, it needs to be able to quickly perform the intersection of two sets of integers.  The next chart shows a performance comparison in terms of execution times of three versions of TopKRules when a parameter k increases and the problem become more difficult, for a dataset called mushrooms.

  • Version A is TopKRules implemented with lists.
  • Version B is TopKRules implemented with bitsets to quickly perform the intersection by doing the logical AND operation.
  • Version C is TopKRules implemented with bitsets  plus using red-black trees for storing candidates and best k rules for quickly performing add, remove minimum and maximum.
Optimization of TopKRules

Optimization of TopKRules

As you can see from this chart, there is quite a large improvement in performance by using appropriate data structures!

That’s all I wanted to write for today. Hope that you enjoyed this post. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

Analyzing the source code of the SPMF data mining software

Hi everyone,

In this blog post, I will discuss how I have applied an open-source tool that is named Code Analyzer ( http://sourceforge.net/projects/codeanalyze-gpl/ )  to analyze the source code of my open-source data mining software named SPMF.

I have applied the tool on the previous version (0.92c) of SPMF, and the tool prints the following result:

Metric                Value
——————————-    ——–
    Total Files                     360
Total Lines                   50457
Avg Line Length                  30
    Code Lines                   31901
    Comment Lines               13297
Whitespace Lines                6583
Code/(Comment+Whitespace) Ratio        1,60
Code/Comment Ratio                2,40
Code/Whitespace Ratio            4,85
Code/Total Lines Ratio            0,63
Code Lines Per File                  88
    Comment Lines Per File              36
Whitespace Lines Per File              18

Now, what is interesting is the difference when I apply the same tool on the latest version of SPMF (0.93). It gives the following result:

Metric                Value
——————————-    ——–
    Total Files                     280
Total Lines                   53165
Avg Line Length                  32
    Code Lines                   25455
    Comment Lines               23208
Whitespace Lines                5803
Code/(Comment+Whitespace) Ratio        0,88
   Code/Comment Ratio                1,10
Code/Whitespace Ratio            4,39
Code/Total Lines Ratio            0,48
Code Lines Per File                  90
    Comment Lines Per File              82
Whitespace Lines Per File              20

As you can see by these statistics, I have done a lot of refactoring for the latest version. There is now 280 files instead of 360 files. Moreover, I have shrunk the code from 31901 lines to 25455 lines, without removing any functionnalities!

Also, I have added a lot of comments to SPMF.  The “Code/Comment” ratio has thus changed from 2.40  to 1.10, and the “Comment Lines per files” went up from 36 to 82 lines.  Totally, there is now around 10,000 more lines of comments than in the previous version (the number of lines of comments has increased from 13297 to 23208).

That’s all I wanted to write for today!  If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

 

How to auto-adjust the minimum support threshold according to the data size

Today, I will do a quick post on how to automatically adjust the minimum support threshold of frequent pattern mining algorithms such as Apriori, FPGrowth and PrefixSpan according to the size of the data.

adjust minsup threshold

The problem is simple.  Let’s consider the Apriori algorithm.  The input of this algorithm is a transaction database and a parameter called minsup that is a value in [0,1].  The output of Apriori is a set of frequent patterns. A frequent pattern is a pattern such that its support is higher or equal to minsup. The support of a pattern  (also called “frequency”) is the number of transactions that contains the pattern divided by the total number of transactions in the database.  A key problem for algorithms like Apriori is how to choose a minsup value to find interesting patterns.   There is no really easy way to determine the best minsup threshold. Usually, it is done by trial and error. Now let’s say that you have determined that for your application the best minsup.

Now, consider that the size of your data can change over time.  In this case how can you dynamically adjust the minimum support so that when you don’t have a lot of data the threshold is higher?   and that when you have more data, the threshold becomes lower ?

The solution

A simple solution that I have found is to use a mathematical function for adjusting the minsup threshold automatically according to the database size (the number of transactions for Apriori).  This solution is shown below.

I choose to use the formula minsup(x) = (e^(-ax-b))+c  where x is the number of transactions in the database and a,b,c are positive constants. This allows to set minsup to a high value when there is not a lot of data and then decrease minsup when there is more data.  For example, on the first chart below, minsup value is set to 0.7 if  there is 1 transaction, it becomes 0.4 when there is 3 transactions and then decrease to 0.2 when there is around 9 transactions. This allow the minsup threshold to become more strict when there is less data. Note that the constants a,b and c can be adjusted to make the curve behave differently.

minsup threshold

On the second chart above, I show the relative minsup threshold. It is simply the minsup threshold multiplied by the database size.  It shows the number of transactions in which a pattern need to appear to become frequent according to the database size.

What is the meaning of the constants a, b, and c?

The constant is the smallest value that this function will produce. For example, if c = 0.2, the function will not generate minsup values that are less than 0.2. The constant a and b influences how quickly the curve will decrease when x increases.

How do we call this function?

In mathematics, this type of function is called an exponential decay function as it is exponentially decreasing when x increases. The idea of using this function for pattern mining was first proposed in my Ph.D thesis:

Fournier-Viger, P. (2010), Un modèle hybride pour le support à l’apprentissage dans les domaines procéduraux et mal-définis. Ph.D. Thesis, University of Quebec in Montreal, Montreal, Canada, 184 pages.

Conclusion

Hope this little trick is helpful!

Next time, I will try to talk about some more general data mining topic. I promised that I would do that last time. But today I wanted to talk about this topic, so I will rather do that next time!

Philippe
Founder of the SPMF data mining software