Papers without code (and the problem of non-reproducible research)

Recently, there has been some debate on the Machine Learning sub-Reddit about the reproducibility or I should say the lack of reproducibility of numerous machine learning papers. Several Reddit users complained that they spent much time (sometimes weeks) to try to replicate the results of some recent research papers but simply failed to obtain the same results.

To address this issue, some Reddit user launched a website called Papers Without code to list all papers that are found to have results that cannot be reproduced. The goal of this website is apparently to save time by indicating which papers cannot be reproduced and perhaps even to put pressure on the authors to make reproducible papers in the future. On the website, someone can submit a report about a paper that is not reproducible by indicating several information such as how much time was spent on trying, what is the title of the paper and its authors. Then, the owner of the website first send an e-mail to the first author of the paper to give a chance to provide an explanation before the paper is added to the list of non-reproducible papers.

The list of papers without code can be found here. At the time of writing this blog post, there are only four papers on the list. For some of thes papers on that list, some people mentioned that they even tried to contact the authors of papers but got some dodgy responses and that some promised to add the code to GitHub with a “coming soon” notice, before eventually deleting the repository.

Personally, I am not sure that creating such website is a good idea because some papers may be added to this list and it may be undeserved in some cases, and have an impact on the reputation of some researchers. But at the same time, there are many papers that simply cannot be reproduced and many people may waste time trying to reproduce them. The website owner of PapersWithoutCode has indicated on Reddit that s/he will at least take some steps to prevent problems from happening such as to verify the information about the sender, and to inform the first author of a paper and giving him at least one week to answer before adding it to the list.

On Reddit a comment was that it is “Easier to compile a list of reproducable” papers. In fact, there is a website called Paper with Code that does that for machine learning, although it is not exhaustive. And some people claimed on Reddit that at least 50% to 90% of papers are not reproducible based on their experience. I don’t know if it is true, but there are certainly many. An undergraduate student on Reddit also said that he does not understand why providing code is not a requirement when submiting a paper. This is a good point, as it is not complicated to create an online repository and upload code…

Why I talk about this? While, I am not convince that this website is a good idea, I think that it raises an important debate about reproducibility of research. In some other fields such as cancer research, it was pointed out that several studies are difficult to reproduce. However, in computer science, this should not be so much of a problem as code can be easily shared. Unless there is some confidential or commerical restrictions on research projects, it should be possible for many researchers to publish their code and data at the time of submiting their papers. Thus, a solution could be to make this requirements more strict for conferences and journals at the time of submission.

Personally, I release the source code and data of almost all the papers where I am the corresponding author. I put the code and data in my open-source SPMF software, unless it is related to a commercial project and that I cannot share the code. This has many advantages: (1) other people can use my algorithms and compare with them without having to spend time to re-implement the same algorithms again, (2) people can use my algorithms for some real applications and it is useful to them, (3) this increase the number of citations of my papers and (4) it convince reviewers that results in my papers can be reproduced.

Another reason why I share the code of my research is that as a professor, much of my research is funded by the government through the university or grants. Thus, I feel that it is my duty to share what I do as open-source code (when possible).

What do you think? I would like to read your opinion in the comment section below.

Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published.