How to encourage data mining researchers to share their source code and datasets?

A few months ago, I wrote a popular blog post on this blog about why it is important to publish source code and datasets for researchers“.  I explained several advantages that researchers can get by sharing the source code of their data mining algorithms such as: (1) other researchers will save time because they don’t need to re-implement your algorithm(s), (2) other researchers are more likely to integrate your algorithms in their software and cite your research papers if you publish your source code and (3) people will compare with the original version of your algorithm rather than their own perhaps faulty or unoptimized implementation. I gave as example, my own open-source library for data mining named SPMF, which was cited in more than 50 papers from researchers all over the world. I have spend quite some time to start this project. But it has helped many researchers all over the world. So there is obviously some benefits to share source code and datasets. But still, why few data mining researchers share their source code and datasets? I think that we should attempt to change this. So the question that I want to ask today is What should we do to encourage researchers to publish their source code and datasets?

There can be many different answers to this question. I will present some tentative answers and I hope that readers will also contribute by adding other ideas in the comment section.

First, I think that a good solution would be that the main data mining journals or conferences would add special tracks for papers who publish their source code and datasets. For example, some popular conferences like KDD already have a general track, an industry track and perhaps some other tracks. If a special track was added for authors who submit their source code/datasets such that they would have a slightly higher chance of being accepted, I think that it would be a great incentive.

Second, an idea is to make special workshops or implementation competition where researchers have to share their code/datasets. For example, in the field of frequent pattern mining, there was two famous implementation competitions FIMI2003 and FIMI2004 (http://fimi.ua.ac.be/) where about 20 algorithms implementations where released. In this case, what was released was not source code for all algorithms, but at least the binaries were released. After 2004, no implementation workshop was done on this topic, and therefore very few authors have released implementations of the newer algorithms on this topic, which is a pity. If there was more workshops like this, it would encourage researchers to share their code and datasets.

Third, one could imagine creating some organized repositories or libraries so that researchers could share their source code and datasets.  There exists some. But not many and they are not very popular.

Fourth, one could think of creating incentives for students/researchers at universities who release their data and code, or even to force them to release their code/data. For example, a department could request that all their student publish their code and data.  Another alternative would be that funding agencies would request that code and data would be shared.

That is all the ideas that I have for now.  If you have some other ideas, please share them in the comment section!


P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

If you like this blog, you can subscribe to the RSS Feed or Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

This entry was posted in Data Mining, Research and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *