In this post, I will discuss what it takes to be a good data mining programmer and how to become one.
Data mining is a broad field that can be approached from several angles. Some people with a mathematical background will employ a statistical approach to data mining and use statistical tools to study data. Others will use already made commercial or open-source data mining software to analyses their data. In this post, we will discuss the computer science view of data mining. It is aimed at programmers who would like to become good at implementing and designing data mining algorithms.
There are some great benefits to not just be a user, but to be a data mining programmer. First, you can implement algorithms that are not offered in existing data mining tools. This is important because several data mining tools are restricted to a small set of algorithms. For example, if you consider data mining tasks such as clustering, there are hundreds of algorithms that have been proposed to handle many different scenarios. However, general purpose data mining tools often only offer just a few algorithms. Second, you can download open-source algorithms and adapt them to your needs. Third, you could eventually design your own data mining algorithms and implement them efficiently.
So now that we have talked about the advantages, let’s talk about how to become a good data mining programmer. We can break this down into two aspects: being good at programming and being knowledgeable at computer science in general, and being good at programming data mining algorithms.
After that, you should try to get a good knowledge of the data structures that are offered in your programming language. A good programmer should know when to use the different data structures. This is important because you will eventually optimize your algorithms. In data mining, optimizations can make the difference between an algorithm that will run for hours or just a few minutes, or use gigabytes or megabytes of memory! So you should get to know the main data structures that are offered such as array lists, linked list, binary trees, hash tables, hash sets, bitsets, priority queue (heaps). But more importantly, you should know that there are many data structures that are not offered with your programming language. You should know how to look up in books or websites for other data structures.
Besides, you should try to get better at algorithmic (designing efficient algorithms) and computer science in general. There are many different way to do that such as taking courses on this topic or to read some books. But most importantly, you need to to put the theory into practice and to do some programming, which leads me to the key part of this post.
To become good at programming data mining algorithms, you need to write data mining algorithms. To get started, you should read some data mining books such as the book by Tan, Steinbach & Kumar, or the book by Han & Kamber. I recommend to start by implementing some simple algorithms without optimizations. For example, K-means or Apriori are relatively easy to implement. After you have debugged and checked that your implementation generates the correct result, you should spend time to think about how to optimize it. First, think about optimizations by yourself. Then look at how other people did it by looking at websites, articles or by looking at the code of other people. Most likely, there are many optimizations that have been proposed. After that, you could implement the optimizations, and then look at more complex algorithms. Finally, remember that Rome was not built in a day. Give yourself some time to learn!
I have obviously not mentioned everything. In particular, being good at mathematics is also important. If you have some additional thoughts, you can share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about next blog posts.