Why it is important to publish source code and datasets for researchers?

Today, I will discuss about why it is important that researchers share their source code and data.

As some of you know, I’m working on the design of data mining algorithms. More specifically, I’m working on algorithms for discovering patterns in databases. It is a problem that dates back to the 1990s. Hundreds of papers have been published on this topic. However, when searching on the Web, I found that there are very few source code or even binary files available.  On some specialized topics like uncertain itemset mining, there is for example about 20 algorithms published but about only two papers that provide the source code and datasets.

This is a serious problem for research.

First, some of these algorithms are hard to implement. For some people that are not  familiar with the subject or that are average programmers, it is a huge waste of time to implement the algorithms again and this could deter them from using the algorithms. As some people say: why reinvent the wheel ?

Second, algorithm descriptions that are provided in research papers are often incomplete due to the lack of space. Some researchers will not provide optimizations details due to the lack of space. Or some researchers will intentionally not provide enough details in their paper so that other people cannot implement their algorithm properly and beat its performance.

Third, let’s say that someone develops a new algorithm and want to compare its performance with an already published algorithm. If this person cannot find the source code or binary files of the published algorithm, he has to implement it by himself. However, this version will be different from the original and depending on how it is implemented, the comparison could potentially be unfair.

Now, let’s talk about what are the advantages of sharing your source code and data.

First, as a researcher, if you publish your source code, it is much more likely that someone will use your algorithm or application. If someone use your algorithm/application, he will  cite you, and it will provide benefits to you.

Second, other researchers can save time if they don’t have to implement again the same algorithms. They can use this time to do more research.  And therefore, this would benefit the whole research community.

Third, if you are the author of an algorithm, other people can compare with your version of your algorithm.  By sharing your source code, you are therefore sure that the comparison will be fair.

Fourth, other people are more likely to integrate your algorithm/software in other software or to modify it to develop new algorithms/software. Again, this will benefit you because these people will cite you.  And the more people will cite you, the more people will read your papers and will cite you.

update in 2018 Now to conclude, I will talk about the benefits that I have received from sharing my work as open-source software since the last few years.  I’m the author of the SPMF data mining software.  This software offers more than 100 algorithms, most of them implemented by me, including a dozen that are my own algorithms.  Since about 8 years, the website has received more than 500,000 visitors and the software has been cited in more than 500 research papers and journal articles. Some people have applied the algorithms in biology, website clickstream analysis and even chemistry. This has also greatly contributed to increasing the citations of my research papers.

I hope that this blog post will convince you that it is important to share the source code and the data of your work with other researchers.

  1. I agree. Most of the time, i cannot get source code i want from research papers. I have contacted many authors to ask for source code and they don’t answer or don’t want to share with me.

  3. I agree about the part where algorithmic implementations are not made available by researchers. Im recently looking for a maximal clique detection parallel scaleable implementation in Hadoop. So far I’ve found some research papers claiming to have a working solution but there are no references for the source code. Can you help?

    • Yes, I think most data mining papers do not share the code and datasets. At least, the paper that I am reading rarely provide the code. For the maximum clique problem, I do not have the source code. I think that the best solution would be to contact the authors of the papers that you have found to ask if they can provide it or at least give you the binaries. You can always say that you will cite their paper. This might help to convince them to share the code. But some researchers just don’t want to share the code. Some of them are afraid that if they share the code, you will beat their algorithms and write another paper….

