This week-end, I have attended the International Conference on Image, Vision and Intelligent system from 18 to 20 June 2021 in Changsha city, China.
It is a medium-sized conference (about 100 participants) but It is well-organized, and there was many interesting activites and speakers, as well as some workshops. The main theme of this conference is about image and computer vision but also some other works more related to intelligent systems where presented.
I have participated to this conference as an invited keynote speaker. I gave a talk on analyzing data for intelligent systems using pattern mining techniques. There was also an interesting keynote talk by Prof. YangXiao from University of Alabama, USA about detecting the theft of electricity from electricity networks and smart grids. Another keynote speaker was Prof. En Zhu from the National University of Defense Technology, who talked about detecting flow and anomalies in images. The fourth keynote speaker was Prof. Yong Wang from Central South University, about optimization algorithms and edge computing. That presentation has shown some cool applications such as drones being used to improve the internet coverage in some area or optimizing the placement of wind turbines in a wind farm. The last keynote speaker was Prof. Jian Yao from Wuhan University, about image-fusion. He shown many advanced techniques to transform images such as to fix light and stitching together overlaping videos.
This my pass, and program book:
Below, is the registration desk. The staff has been very helpful through the conference:
This is one of the room for listening to the talks:
This is a group picture:
There was also social activities such as an evening dinner and banquet, where I met many interesting researchers that I will keep contact with.
That is all of what I will write for today. It is just to give a quick overview of the conference. Next month, I will write about the ICSI 2021, CCF-AI 2021 , DSIT 2021 , and IEA AIE 2021 conferences, that I will also attend.
— Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.
In this blog post, I will talk about the most important algorithms for high utility itemset mining. I will present a list of algorithms that is of course subjective (according to my opinion). I did not list all the papers of all authors but I have listed the key papers that I have read and found interesting, as they introduced some innovative ideas. This list is partly based on my survey of high utility itemset mining. To help you while reading these papers, you may also check my glossary of high utility itemset mining, which explain the key terms used in that research field.
Author / Date
Papertitle
Key idea
Yao 2004
A Unified Framework for Utility-based Measures for Mining Itemsets
– This paper defined the problem of high utility itemset mining.
Liu 2005
A two-phase algorithm for fast discovery of high utility itemsets
– The first complete algorithm for high utility itemset mining, named Two Phase. – Two Phase is an Apriori based algorithm, and it does two phases to find high utility itemsets – Introduced the TWU upper bound for reducing the search space.
Ahmed 2009
Efficient Tree Structures for High-utility Pattern Mining in Incremental Databases
– The first FP-Growth based algorithm for high utility itemset mining, named IHUP. – Can discover high utility itemset incrementally.
Wu 2011 / 2013
Efficient algorithms for mining high utility itemsets from transactional databases
– Improved FP-Growth algorithms for high utility itemset mining, called UP-Growth and UP-Growth+. – These algorithms adopt several strategies to reduce the TWU upper bound. – The paper was published in KDD and then in TKDE, which gave a high visibility to high utility itemset mining.
Liu 2012
Mining high utility itemsets without candidate generation
– Presented HUI-Miner, the first algorithm for mining high utility itemset mining in one phase. This has revolutionized this field as all previous algorithms were using two phases. Speeds up of 10 to 100 times can be obtained compared to the previous best algorithms – Introduced the remaining utility upper bound which is tighter than the TWU. – HUI-Miner is based on Eclat
– Presented a fast vertical algorithm for high utility itemset mining, called FHM that adopts a new technique called co-occurrence pruning. This can further speed up the task by 5 to 10 times. -FHM is based on HUI-Miner, and was shown to outperform it.
– Another major performance improvement. – This paper presented EFIM a novel algorithm that mines high utility itemsets almost in linear time and space. – Introduced several new ideas for high utility mining like transaction merging and using arrays for utility and upper bound counting. – The sub-tree utility upper bound can be tighter than the upper bounds of HUI-Miner and FHM. – This algorihtm is inspired by HUI-Miner and LCM.
Duong 2017
Efficient High Utility Itemset Mining using Buffered Utility-Lists
– Proposed to reduce the memory usage of HUI-Miner and FHM based algorithms using the concept of buffered utility-lists. – The modified algorithm is called ULB-Miner
Qu 2020
Mining High Utility Itemsets Using Extended Chain Structure and Utility Machine
– Proposed the REX algorithm, a one phase algorithm, which adopts new strategies such as k-item utility machine and a switch strategy.
Tseng 2013 /2015
Efficient Algorithms for Mining Top-K High Utility Itemsets
– Tseng proposed the task of top-k high utility itemset mining where the user directly set the number k of patterns to be found. – In this journal version of the paper, a fast one-phase algorithm called TKO is presented, which extends HUI-Miner, and beat the TKU algorithm presented in the conference paper.
– To reduce redundancy, this paper proposed to discover a subsets of all high utility itemsets called the generators of high utility itemsets. – An algorithm called GHUI-Miner is presented based on FHM. – It can be argued that these itemsets are more useful in some case than all high utility itemsets.
Wu 2019
Efficient algorithms for high utility itemset mining without candidate generation
– This paper presented an algorithm called CHUI-Miner for discovering the maximal high utility itemset. – This algorithm is based on HUI-Miner. – Maximal itemsets are the largest one. Discovering them can greatly reduce the number of patterns shown to the user.
– This paper makes the observation that finding very long patterns is unecessary. -Thus an optimized algorithm called FHM+ is presented to reduce the upper-bounds and gain better performance when searching for high utility itemset using a length constraint. – FHM+ is based on FHM
– This paper introduce the concept of periodic patterns in high utility itemset mining. – The goal is not only to find patterns that have a high utility but also patterns that appear periodically over time. For example, one may find that a customer periodically purchase beer andwine every week or so. – The PHM algorithm was presented, which is inspired by HUI-Miner and PFPM.
– This is the first paper on closed high utility itemset mining. – This paper introduced the CHUD algorithm, which is inspired by DCI_Closed. – Closed itemsets allows to obtain a small set of high utility itemsets that provides concise information about all high utility itemsets (a summary). – There have been several more efficient algorithms after that such as EFIM-Closed and CLS-Miner. However, CHUD is the first one.
-This paper introduced a FHM-based algorithm called MinFHM to find the high utility itemsets that are minimal (not included in larger high utility itemsets). – This can be useful for some applications.
Hong 2009
Mining High Average-Utility Itemsets
– This paper has introduced the problem of high average utility itemset mining. – There has been many algorithms on this topic afterward. The utility is divided by the length of a pattern to avoid finding patterns that are too long. – The TPAU and PBAU algorithms were presented which are inspired by Two-Phase, Apriori and Eclat.
Truong 2018
Efficient Vertical Mining of High Average-Utility Itemsets based on Novel Upper-Bounds
– This paper introduced the concept of vertical upper bounds in high average utility itemset mining. This has provided a major performance boost. – The dHAUIM algorithm was presented, and published in TKDE, a top data mining journal.
Yin 2012
USpan: an efficient algorithm for mining high utility sequential patterns
– This paper presented an algorithm USpan for high utility sequential pattern mining, which is a related task that aims to find high utility patterns in sequence. – It is not the first algorithm for this problem, but it was published in KDD and is arguably the most popular. Thus, I have selected it.
Lin 2015
Mining high utility itemsets in big data
– The first algorithm for mining high utility itemsets using a big data framework (Hadoop).
– An algorithm named HUSRM to find high utility sequential rules. – This topic is similar to high utility sequenital patterns mining but rules are found that have a high confidence.
Cagliero 2017
Discovering high utility itemsets at multiple abstraction levels
– The first paper to use a taxonomy of items for multi-level high utility itemset mining. – The ML-HUI-Miner algorithm is an extension of HUI-Miner.
– This paper has generalized the paper of Cagliero 2017 so as to find cross-level high utility itemsets (itemsets containing items from any abstraction levels of a taxonomy). – The proposed CLH-Miner algorithm extends FHM. A top-k version of CLH-Miner called TKC was also proposed in another paper.
Chu et al., 2009
An efficient algorithm for mining high utility itemsets with negative item values in large databases
– Most algorithm for high utility itemset mining suppose that utility must be a positive number (e.g. amount of money). – This is the first paper to consider that the utility can also be negative (e.g. selling an item at a loss in a supermarket). – The HUI-NIV-Mine algorithm was designed for this task. It is a two phase algorithm inspired by Two-Phase.
Goyal 2015
Efficient Skyline Itemsets Mining
– This paper presented the first algorithm that mine skyline high utility itemsets. – The idea is to find patterns that are not dominated by other patterns by considering their support and utility to find a Pareto front. – Other more efficient algorithms were proposed later.
– This paper presents the current state-of-the-art algorithm for high utility quantitative itemset mining. – In this problem, patterns contains quantities. For example, a high utility itemset may say that a customer buys 2 to 5 breads with 1 or 2 bottles of milk. – The original problem was proposed by Yen (2007) but this is the newest algorithm, based on FHM.
Kannimuthu1 2014
Discovery Of High Utility Itemsets Using Genetic Algorithm With Ranked Mutation
– This is one of the first heuristic algorithm for high utility itemset mining. – It utilizes a genetic algorithm to find an approximation of all high utility itemsets. – After that many algorithms have used other heuristics in recent papers.
Wu 2013
Mining High Utility Episodes in Complex Event Sequences
– Proposed US-Span, the first algorithm for finding high utility episodes, that is subsequences of high utility in a sequence of events.
– Proposed to consider the time dimension to find peak high utility itemsets and local high utility itemsets, that is itemsets that have a high utility in some non predefined time intervals (e.g. some products may yield a high product during Christmas time only). – The LHUI-Miner algorithm and PHUI-Miner algorithm are variation of the FHM algorithm.
– This paper aims to find correlated high utility itemsets, that is itemsets that not only have a high utility (importance) but also contains items that are highly correlated. This is to avoid finding patterns that have a high utility but just appear together by chance. – Two measures from frequent itemset mining are incorporated into high utility itemset mining called the bond and all-confidence. – The designed algorithm FCHM-Miner is an extension of the FHM algorithm.
In this blog post, I have listed some key papers about high utility itemset mining. As I said above, this list is based on my opinion. But I think it can be useful. Hope you have enjoyed this post.
In this blog post, I will list the key algorithms for periodic itemset mining (according to me) with comments about their contributions. Of course, this list is subjective. I did not list all the papers of all authors but I have listehttps://data-mining.philippe-fournier-viger.com/introduction-to-the-apriori-algorithm-with-java-code/d the key papers that I have read and found interesting, as they introduced some innovative ideas. I did not list papers that were mostly incremental in their contributions. I also did not list papers that had very few references unless I found them interesting.
Algorithm
Author / Date
Key idea
PF-Growth
Tanbeer 2009
– First algorithm for periodic itemset mining in transactions Uses the maxPer constraint to select periodic patterns. – Based on FP-Growth
MTKPP
Amphawan 2009
– First algorithm for top-k periodic itemset mining – Uses the maxPer constraint to select periodic patterns. – Based on Eclat
ITL-tree
Amphawan 2010
– Performs an approximate calculation of the periods of patterns – Based on FP-Growth
MIS-PF-tree
Kiran 2009
– Mining periodic patterns with a maxPer threshold for each item – Based on FP-Growth
Lahiri 2010
– Proposed to study periodic patterns as subgraphs in a sequence of graphs.
PFP
Rashid 2012
– Find periodic itemsets using the variance of periods. – The periodic patterns are called regular periodic patterns. – Based on FP-Growth
– Generalize the problem of periodic itemset mining to provide more flexibility using three measures: the average periodicity, minimum periodicity and maximum periodicity – It is shown that average periodicity is inversely related to the support measure. – Based on Eclat
PHM
Fournier-Viger 2015
– An extension of the PFPM algorithm to mine high utility itemsets (itemsets that are periodic but also important such as yield a high profit) – Based on PFPM, Eclat and FHM
– Find periodic patterns in multiple sequences – Introduce a measure called “sequence periodic ratio“ – Based on PFPM and Eclat
MRCPPS
Fournier-Viger 2019
– Find periodic patterns in multiple sequences that are rare and correlated. – Use the sequence periodic ratio, bond measure and maximum periodicity to select patterns – Based on PFPM and Eclat
PPFP
Nofong 2016
– Find periodic itemsets using the standard deviation of periods as measure to select patterns. – Apply a statistical test to select periodic patterns that are significant. – Vertical algorithm based on Eclat and inspired by OPUS-Miner for the statistical test
PPFP+, PFP+…
Nofong 2018
– Find periodic itemsets using the standard deviation and variance of periods as measure to select patterns. – The measures are integrated in existing algorithms such as PPFP and PFP
PHUSPM
Dinh 2018
– Proposed to find periodic sequential patterns (subsequences that are periodic)
– Find the stable periodic patterns using a novel measure called lability. – The goal is to find patterns that are generally stable rather than enforcing a very strict maxPer constraint as many algorithms do. – Based on FPGrowth
– Find locally periodic patterns (periodic in some time intervals rather than the whole database). That is, unlike most algorithms, it is not assumed that a pattern must be always periodic. – LPP-Growth is based on FPGrowth – LPP-Miner is based on PFPM, which is inspired by Eclat and Apriori-TID
Implementations
Several algorithms above are implemented in the SPMF data mining software in Java as open-source code.
Some survey papers
I have also written two chapters recently that give some overview of some topics on periodic pattern mining. You may read them if you want to have a quick and easy-to-understand overview of some topics in periodic pattern mining.
Fournier-Viger, P., Chi, T. T., Wu, Y., Qu, J.-F., Lin, J. C.W., Li, Z. (2021). Finding Periodic Patterns in Multiple Discrete Sequences. In the book “Periodic Pattern Mining: Theory, Algorithms and Application”, Springer, to appear.
Conclusion
In this blog post, I have listed some key references in periodic pattern mining. Of course, I did not list all the references of all authors. I mainly listed the key papers that I have read and found interesting. This is obviously subjective.
On this blog, I have previously given an introduction to a popular data mining task called high utility itemset mining. Put simply, this task aims at finding all the sets of values (items) that have a high importance in a database, where the importance is evaluated using a numeric function. That sounds complicated? But it is not. A simple application is for example to analyze a database of customer transaction to find the sets of products that people buy together and yield a lot of money (values = purchased products, utility = profit). Finding such high utility patterns can then be used to understand the customer behavior and take business decisions. There are also many other applications.
High utility itemset mining is an interesting problem for computer science researchers because it is hard. There are often millions of ways of combining values (items) together in a database. Thus, an efficient algorithm for high utility itemset mining must search to find the solution (the set of high utility itemsets) while ideally avoid exploring all the possibilities.
To efficiently find a solution to a high utility itemset mining problem (task), several efficient algorithms have been designed such as UP-Growth, FHM, HUI-Miner, EFIM, and ULB-Miner. These algorithms are complete algorithms because they guarantee finding the solution (all high utility itemsets) However, these algorithms can still have very long execution times on some databases depending on the size of the data, the algorithm’s parameters, and the characteristics of the data.
For this reason, a research direction in recent years has been to also design some approximate algorithms forhigh utility itemset mining. These algorithms do not guarantee to find the complete solution but try to be faster. Thus, there is a trade-off between speed and completness of the results. Most approximate algorithms for high utility itemset mining are based on optimization algorithms such as those for particle-swarm optimization, genetic algorithms, the bat algorithm, and bee swarm optimization.
Recently, my team proposed a new paper in that direction to appear in 2021, where we designed two new approximate algorithms, named HUIM-HC and HUIM-SA, respectively based on Hill Climbing and Simulated Annealing. The PDF of the paper is below:
In that paper, we compare with many state-of-the art approximate algorithms for this problem (HUIF-GA, HUIM-BPSO, HUIM-BA, HUIF-PSO- HUIM-BPSOS and HUIM-GA) and observed that HUIM-HC all algorithms on the tested datasets. For example, see some pictures from some runtime experiments below on 6 datasets:
In this picture, it can be observed that HUIM-SA and HUIM-HC have excellent performance. In a) b) c) d), e), f) HUIM-HC is the fastest, while HUIM-SA is second best on most datasets (except Foodmart).
In another experiment in the paper it is shown that although HUIM-SA is usually much faster than previous algorithms, it can find about the same number of high utility itemsets, while HUIM-HC usually find a bit less.
If you are interested by this research area, there are several possibilities for that. A good starting point to save time is to read the above paper and also you can find the source code of all the above algorithms and datasets in the SPMF data mining library. By using that source code, you do not need to implement these algorithms again and can compare with them. By the way, the source code of HUIM-HC and HUIM-SA will be included in SPMF next week (as I still need to finish the integration).
Hope that this blog post has been interesting! I did not write so much on the blog recently because I have been very busy and some unexpected events occurred. But now I have more free time and I will start again to write more on the blog. If you have any comments or questions, please write a comment below.
—
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.
Hi all, This is to let you know that the UDML workshop on utility driven mining and learning is back again this year atICDM2021, for the fourth edition.
The topic of this workshop is the concept of utility in data mining and machine learning. This includes various topics such as:
Utility pattern mining
Game-theoretic multiagent system
Utility-based decision-making, planning and negotiation
Models for utility optimizations and maximization
All accepted papers will be included in the IEEE ICDM 2021 Workshop proceedings, which are EI indexed. The deadline for submiting papers is the 3rd September 2021.
I am glad to announce that I am co-organizing a new workshop called MLiSE 2021(1st international workshop on Machine Learning in Software Engineering), held in conjunction with the ECML PKDD 2021 conference.
Briefly, the aim of this workshop is to bring together the data mining and machine learning (ML) community with the software engineering (SE) community. On one hand, there is an increasing demand and interest in Software Engineering (SE) to improve quality, reliability, cost-effectiveness and the ability to solve complex problems, which has led researchers to explore the potential and applicability of ML in SE. For example, some emerging applications of ML for SE are source code generation from requirements, automatically proving the correctness of software specifications, providing intelligent assistance to developers, and automating the software development process with planning and resource management. On the other hand, SE techniques and methodologies can be used to improve the ML process (SE for ML).
The deadline for submiting papers is the 23rd June 2021, and the format is 15 pages according to the Springer LNCS format.
All papers are welcome that are related to data mining, machine learning and software engineering. These papers can be more theoretical or applied, and from academia or the industry. If you are interested to submit but are not sure if the paper is relevant, feel free to send me an e-mail.
The papers will be published on the MLiSE 2021 website. Moreover, a Springer book and special journal issue are being planned (to be confirmed).
Hope that this is interesting and that I will see your paper submissions in MLiSE 2021 soon:-)
In this blog post, I will share the video of our most recent data mining paper presented last week at ACIIDS 2021. It is about a new algorithm named POERM for about analyzing sequences of events or symbols. The algorithm will find rules called “episode rules” indicating strong relationships between events or symbols. This can be used to understand the data or do prediction. Some applications are for example, to analyse sequence of events in a computer network or analyze the purchase behavior of customers in a store. This paper received the best paper award at ACIIDS 2021!
In this blog post, I will give a brief report about the ACIIDS 2021 conference, that I am attending from April 7–10, 2021.
What is ACIIDS?
ACIIDS is an international conference focusing on intelligent information and database systems. The conference is always held in Asia. In the past, it was organized in different countries such as Thailand, Vietnam, Malaysia, Indonesia and Japan. This year, the conference was supposed to be in Phuket, Thailand. However, due to the coronavirus pandemic, it was held online using the Zoom platform. It is the first time that I attend this conference. This year, the timing was good so I have decided to submit two papers, which were accepted.
Here is the list of countries where ACIIDS was held in previous years:
Conference programof ACIIDS 2021
The conference has received 291 papers, from which 69 papers were selected for oral presentation and published in the proceedings, and about 33 more papers were published in a second volume. This means an acceptance rate of 23 %for the main proceedings, which is a somewhat competitive. The papers cover various topics such as data mining techniques and applications, cybersecurity, natural language processing, decision support systems, and computer vision. The conference ACIIDS is now in the CORE B ranking.
The main proceedings of ACIIDS 2021 is published by Springer in the Lecture Notes in Artificial Intelligence series, which ensures good visibility and indexing. The second proceedings books was published in another series of books from Springer.
There was four keynote speakers, that gave some interesting talks:
Opening ceremony
The conference started in the afternoon in Asia with the opening ceremony.
First day paper presentations
During the first day, I listened to several presentations. My team presented two papers:
During that session, there was not so many people (likely due to the different time zones) but I had some good discussions with other participants. In the first paper (video here), we presented a new algorithm for discovering episode rules in a long sequence of events. In the second paper, we investigated the importance of crossover operators in genetic algorithms for a data mining task calledhigh utility itemset mining.
ACIIDS, a well organized virtual conference
The organization of the ACIIDS conference was very well-done. Over the last year, I have attended several virtual conferences such as ICDM, PKDD, PAKDDand AIOPS, to just name a few . In general, I think that virtual conferences are not as enjoyable as “real” conferences (see my post: real vs virtual conferences), because it is harder to have discussions with other attendees and many attendees will not attend much of the conference.
Having said that, I think that ACIIDS organizers did a good job to try to make it an interesting virtual event to to increase the participation of attendees in the activities. What did they do? First, before the conference, they sent an e-mail to all attendees to collect pictures of us giving our greetings to the ACIIDS conference to then make a video out of it. Second, the organizers also created a contest where we could submit a picture of an intriguing point-of-interest in our city, and there was a vote about the best one during the conference. Third, there was several interesting activities such as a live Ukelele show. Fourth, the organizers gave several awards to paper including some more or less serious, including a award called the “most intelligent paper”. Fifth, to increase participation, an e-mail was sent everyday to attendees to remind them about the schedule.
Here are a few pictures from some of the virtual social activities:
The Ukulele show at ACIIDS 2021
Greetings from all around the world at ACIIDS 2021
Here are some pictures of some of the “Top 3” awards given to authors (some are serious and some not serious and just based on a statistical analysis):
The paper from my student, received the “most internationalized paper award” as we have authors from three continents:
Last day: more paper presentations, awards and closing ceremony
On the last day of the conference, there was more paper presentations, which were followed by the best paper award ceremony and the closing ceremony. It was announced that my student paper received the best paper award:
Next year: ACIIDS 2022
It was announced that ACIIDS 2022 would be organized next year in Almaty, Kazakhstan around June. Almaty is the biggest city in Kazakhstan, so that should be interesting.
Registration fees
The registration fee for this year at ACIIDS were lower than usual, perhaps due to the conference being online. This makes the conference more attractive and affordable this year. Here is a comparison with previous years:
That is all for this blog post about ACIIDS 2021. Overall, it was an enjoyable conference. I did not attend all the sessions as I was quite busy this week, but what I saw was good. Thus, looking forwards to ACIIDS 2022.
In this blog post, I will talk about a great resource that can help to improve your academic writing. It is a website, called the Academic Phrasebankcreated by Dr. John Morley from the University of Manchester. This website, also published as a book, contains lists of sentences that are commonly used in research papers, each categorized according to its functions and the section of the paper where it appears. This bank of common sentences is very useful for authors who don’t know how to express their ideas, or would like to get some inspiration or find different ways of writing their ideas.
Below, is a screen shot of the main menu of the website:
There are six categories corresponding to the different sections of a typical research paper. Consider the first category, which is called “Introducing Work“. Lets say that we click on it to find out about common sentences to be used in the introduction of a paper. Then, some sub-categories are shown such as:
Establishing the importance of the topic for the world or society
Establishing the importance of the topic for the discipline
Establishing the importance of the topic as a problem to be addressed
Explaining the inadequacies of previous studies
Explaining the significance of the current study
Describing the research design and the methods used
…
Then, lets say that we choose the first sub-category. This will show us a list of common sentences for establishing the importance of a research topic. Here are a few of those sentences:
X is fundamental to …
X has a pivotal role in …
X is frequently prescribed for …
X is fast becoming a key instrument in …
X plays a vital role in the metabolism of …
X plays a critical role in the maintenance of …
Xs have emerged as powerful platforms for …
X is essential for a wide range of technologies.
X can play an important role in addressing the issue of …
Xs are the most potent anti-inflammatory agents known.
There is evidence that X plays a crucial role in regulating …
… and many others…
I will not show more examples but you can try the website to have a look at other categories
My opinion: I think that this website is quite rich and useful. I write many research papers and tend to always use more or less the same sentence structures. But by looking at this phrase bank, it gives me some ideas about how I could try using other types of sentences as well. I think that this can help improve my writing style a bit in future papers!
That is all for today. I just wanted to share this useful resource for academic writing!
Recently, there has been some debate on the Machine Learning sub-Reddit about the reproducibility or I should say the lack of reproducibility of numerous machine learning papers. Several Reddit users complained that they spent much time (sometimes weeks) to try to replicate the results of some recent research papers but simply failed to obtain the same results.
To address this issue, some Reddit user launched a website called Papers Without code to list all papers that are found to have results that cannot be reproduced. The goal of this website is apparently to save time by indicating which papers cannot be reproduced and perhaps even to put pressure on the authors to make reproducible papers in the future. On the website, someone can submit a report about a paper that is not reproducible by indicating several information such as how much time was spent on trying, what is the title of the paper and its authors. Then, the owner of the website first send an e-mail to the first author of the paper to give a chance to provide an explanation before the paper is added to the list of non-reproducible papers.
The list of papers without code can be found here. At the time of writing this blog post, there are only four papers on the list. For some of thes papers on that list, some people mentioned that they even tried to contact the authors of papers but got some dodgy responses and that some promised to add the code to GitHub with a “coming soon” notice, before eventually deleting the repository.
Personally, I am not sure that creating such website is a good idea because some papers may be added to this list and it may be undeserved in some cases, and have an impact on the reputation of some researchers. But at the same time, there are many papers that simply cannot be reproduced and many people may waste time trying to reproduce them. The website owner of PapersWithoutCode has indicated on Reddit that s/he will at least take some steps to prevent problems from happening such as to verify the information about the sender, and to inform the first author of a paper and giving him at least one week to answer before adding it to the list.
On Reddit a comment was that it is “Easier to compile a list of reproducable” papers. In fact, there is a website called Paper with Code that does that for machine learning, although it is not exhaustive. And some people claimed on Reddit that at least 50% to 90% of papers are not reproducible based on their experience. I don’t know if it is true, but there are certainly many. An undergraduate student on Reddit also said that he does not understand why providing code is not a requirement when submiting a paper. This is a good point, as it is not complicated to create an online repository and upload code…
Why I talk about this? While, I am not convince that this website is a good idea, I think that it raises an important debate about reproducibility of research. In some other fields such as cancer research, it was pointed out that several studies are difficult to reproduce. However, in computer science, this should not be so much of a problem as code can be easily shared. Unless there is some confidential or commerical restrictions on research projects, it should be possible for many researchers to publish their code and data at the time of submiting their papers. Thus, a solution could be to make this requirements more strict for conferences and journals at the time of submission.
Personally, I release the source code and data of almost all the papers where I am the corresponding author. I put the code and data in my open-source SPMF software, unless it is related to a commercial project and that I cannot share the code. This has many advantages: (1) other people can use my algorithms and compare with them without having to spend time to re-implement the same algorithms again, (2) people can use my algorithms for some real applications and it is useful to them, (3) this increase the number of citations of my papers and (4) it convince reviewers that results in my papers can be reproduced.
Another reason why I share the code of my research is that as a professor, much of my research is funded by the government through the university or grants. Thus, I feel that it is my duty to share what I do as open-source code (when possible).
What do you think? I would like to read your opinion in the comment section below.