A common problem in research on data mining is that researchers proposing new data mining algorithms often do not compare the performance of their new algorithm with the current state-of-the art data mining algorithms.
For example, let me illustrate this problem. Recently, I have done a literature review on the topic of frequent subgraph mining. I have searched for all recent papers on this topic and also older papers. Then, based on this search, I have drawn the following diagram, where an arrow X –>Y from an algorithm X to an algorithm Y indicates that X was shown to be a better algorithm than Y in terms of execution time by the authors of X in an aexperiment.
What is the problem ? The problem is that several algorithms have not been compared with each other. Therefore if someone wants to know which algorithm is the best, it is impossible to know the answer without implementing the algorithms again and performing the comparison (because many of them have not disclosed their source code or binary files publicly). For example, as we can see in the above diagram,
- GSPAN (2002) was shown to be faster than FSG (2001), the initial algorithm for this problem.
- Then FFSM (2003) was shown to be faster than GSPAN (2002).
- Then, GASTON (2004) was shown to be faster than FSG (2001) and GSPAN (2002). But it was not compared with FFSM (2003). It is therefore still unknown if GASTON is faster than FFSM (2003).
- Thereafter, SLAGM (2007) was proposed and was not compared with any of the previous algorithms.
- Then, FSMA (2008) was proposed and was only compared with SLAGM (2007), which was has still not been compared with any of the previous algorithms.
- Then, FPGraphMiner (2011) was proposed. It was compared with GSPAN (2002). But it was not compared with Gaston (2004), which was shown to be faster than GSPAN (2002). Moreover, it was not compared with FFSM (2003)
Note that on the above diagram boxes in green represents algorithms for closed subgraph mining (ClosedGraphMiner (2003) and maximal subgraph mining (SPIN(2004) and MARGIN (2006)). For these algorithms:
- ClosedGraphMiner (2003) was compared with GSPAN (2002), which is fine since it was the best algorithm at that time.
- SPIN (2004) was compared with FFSM (2003) and GSPAN (2002), which is fine since they were the two best algorithms at that time.
- However, MARGIN (2006) was not compared with SPIN (2004). It was only compared with FFSM (2003). Moreover, it was not compared with GASTON (2004).
As you can see from this example, the topic of subgraph mining is a little bit of a “mess” if one wants to know which algorithm is the best. It could be tempting to use the above diagram to make transitive inference. For example, we could use the links FFSM <– MARGIN <– FPGraphMiner to infer that FPGraphMiner should be faster than FFSM). However, there is a risk to draw such inference, given that performance comparison can often vary depending on the choice of datasets and so on.
So what is the solution to this problem? Personally, I think that the solution is that researchers should make their source code or binary files public, as well as their datasets. This would facilitate the comparison of algorithm performance. However, it would not solve all problems since different researchers may use different programming languages for example. Another part of the solution would be that researchers would make the extra effort to implement more competitor algorithms when doing performance comparison, and that they should do their best to choose the most recent algorithms.
This is all of what I wanted to write for today! Hope that you have enjoyed this post. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.