In this post, I will discuss what are the steps that I follow to implement a data mining algorithm.
The subject of this post comes from a question that I have received by e-mail recently, and I think that it is interesting to share the answer. The question was :
When we do he programming I’ve observed a good programmer always makes a flow chart, data flow diagram, context diagram etc. to make the programming error free. (…) In your implementation of XXXX algorithm, have you performed any procedures as discussed above? (…) Can you explain me what are the basic procedures that you have applied to implement XXXX algorithm?
When I implement a data mining algorithm, I don’t do any flowchart or UML design because I consider that a single data mining algorithm is a small project. If I was programming larger software program, I would think about object-oriented design and UML. But for a single algorithm, it does not need to be an object oriented design. Actually, what is the most important for a data mining algorithm is that the algorithm produces the correct result, is fast and preferably that the code is well-documented and clean.
Step 1: understanding the algorithm. For implementing a data mining algorithm, the first step that I perform is to read the research paper describing it and make sure that I understand it well. Usually, I need to read the paper a few times to understand it. If there is an example in the paper I will try to understand it and later I will use this example to test the algorithm to see if my implementation is correct. After I read the paper a few times, I may still not understand some details about the algorithm. But this is ok. There is often some tricky details that you may only understand when doing the implementation because the authors do not always give all the details in the paper, sometimes due to the lack of space. Especially, it is common that authors do not describe all the optimizations and data structures that they have used in a research paper.
Step 2: implementing a first draft of the algorithm step by step. Then, I start to implement the algorithm. To implement the algorithm I print the pseudocode and i try to implement it. For an algorithm that takes a file as input, I will first work on the code for reading the input file. I will test this code extensively to make sure that I read the input file correctly. Then, I will add additional details to perform the first operations of the algorithm. I will check if the intermediary result is correct by comparing with the example in the paper. If it is not correct I will debug and maybe read the paper again to make sure that I did not make a mistake because I did not understand something in the paper. Then I will continue until the algorithm is fully implemented.
Step 3: testing with other input files. When my implementation becomes correct, I will try with a few more examples, to make sure that it is not correct for a single example but that it can provide the correct result for other input files.
Step 4: cleaning the code. After that, I will clean my code because the first draft is likely not pretty.
Step 5: optimizing the code. Then I will try to optimize the algorithm in terms of (1) using better data structures, (2) simplifying the code by removing unecessary operations, etc. For example, for my implementation of PrefixSpan in my SPMF data mining software, I first made a very simple implementation without an optimization called pseudo-projection that is described in the paper. It was very slow. After my implementation was correct, I took a few days to optimize it. For example, I added the pseudo-projection, I also added code for another optimization which is to remove infrequent items after the first input file scan, I removed some unnecessary code that I had left, I reorganized some code, I added some comments, etc.
Step 6: Comparison of the performance with other implementations of the same algorithm / peer review. After my code is optimized, as an optional sixth step, I may compare the performance of my implementation with other implementations of the same algorithm if some are available on the Internet in the same programming language.If my implementation is slower, I may look at the source code of the other implementation to see if there is some ideas that I have not thought of that could be used to further optimize my code. I may also ask some of my friends or colleagues to review my code. Another good way is to not show the code to your colleague but just to explain them the main idea to get their feedback. Discussing with other people is a good way to learn.
It takes times… Note that being good at programming data mining algorithms takes time. For example, let me tell you about my story. The first time that I implemented data mining algorithms was in december 2007. I implemented the Apriori algorithm for a course project at university. My implementation was terrible and slow… But it generated the correct result. I then implemented PrefixSpan in 2008. At that time, my code was better because I was gaining some experience on implementing this kind of algorithms. Then in 2010, I read my Apriori and PrefixSpan code again and I still found some problems and I optimized them again. What I want to say here is that it is normal that the first implementation(s) of data mining algorithms that one person makes may not be very good. But after implementing a few algorithms, it becomes much easier. Now, we are in 2013 and I have implemented more than 45 algorithms in my open-source SPMF Java data mining software!
This is what I wanted to write for today. Hope that you enjoyed this post. If you want to read more on this topic, you may be interested by my post on How to be a good data mining programmer. Lastly, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!