In this post, I will discuss what it takes to be a **good data mining programmer** and how to become one.

**Data mining** is a broad field that can be approached from several angles. Some people with a mathematical background will employ a statistical approach to data mining and use statistical tools to study data. Others will use already made commercial or open-source data mining software to analyses their data. In this post, we will discuss the computer science view of data mining. It is aimed at programmers who would like to become good at **implementing and designing data mining algorithms**.

There are some great benefits to not just be a user, but to be a **data mining programmer. **First, you can implement algorithms that are not offered in existing **data mining tools. This is important because several data mining tools are restricted to a small set of algorithms. **For example, if you consider data mining tasks such as clustering, there are hundreds of algorithms that have been proposed to handle many different scenarios. However, **general purpose data mining tools** often only offer just a few algorithms. Second, you can download open-source algorithms and adapt them to your needs. Third, you could eventually design your own **data mining algorithms** and implement them efficiently.

So now that we have talked about the advantages, let’s talk about how to become a **good data mining programmer.** We can break this down into two aspects: being good at programming and being knowledgeable at computer science in general, and being good at programming data mining algorithms.

To be **good at programming**, you should have good knowledge of at least one programming language that you will use. Choosing a programming language is important because performance is generally important in data mining. So you may go for a language like C++ that will compile to machine code, or some languages like Java or C# that are reasonably fast and can be more convenient to use. You should avoid web languages such as PHP and Javascript that are less efficient, unless you have some good reasons to use them.

After that, you should try to get a good knowledge of the data structures that are offered in your programming language. A good programmer should know when to use the different data structures. This is important because you will eventually optimize your algorithms. In data mining, **optimizations** can make the difference between an algorithm that will run for hours or just a few minutes, or use gigabytes or megabytes of memory! So you should get to know the main data structures that are offered such as array lists, linked list, binary trees, hash tables, hash sets, bitsets, priority queue (heaps). But more importantly, you should know that there are many **data structures** that are not offered with your programming language. You should know how to look up in books or websites for other data structures.

Besides, you should try to get better at algorithmic (designing efficient algorithms) and computer science in general. There are many different way to do that such as taking courses on this topic or to read some books. But most importantly, you need to to put the theory into practice and to do some programming, which leads me to the key part of this post.

To become good at programming data mining algorithms, you need to write data mining algorithms. To get started, you should read some data mining books such as the book by Tan, Steinbach & Kumar, or the book by Han & Kamber. I recommend to **start by implementing some simple algorithms without optimizations**. For example, K-means or Apriori are relatively easy to implement. After you have debugged and checked that your implementation generates the correct result, you should spend time to think about how to optimize it. First, think about optimizations by yourself. Then look at how other people did it by looking at websites, articles or by looking at the code of other people. Most likely, there are many optimizations that have been proposed. After that, you could implement the optimizations, and then look at more complex algorithms. Finally, remember that Rome was not built in a day. Give yourself some time to learn!

I have obviously not mentioned everything. In particular, being good at mathematics is also important. If you have some additional thoughts, you can share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about next blog posts.

Pingback: What are the steps to implement a data mining algorithm? - The Data Mining & Research Blog

it was hopeful but very abstract

Hello

I would be grateful if you could send me an email to consultant you regarding doing PhD in data mining , spent 3 months so far not what to do.

Thank you for this useful articles

Ahmed ALjuboori

Hello

Glad you like the articles. You may ask your question(s) in the data mining forum:

http://forum.ai-directory.com/list.php?5

Usually, I don’t have much time to help other people because I have to take care of my own master degree students first and I’m very busy teaching. But if there is some easy questions to answer on the forum, I usually answer them. Or some other users of the forum may answer it too.

Please suggest me subtopic in data mining for mtech thesis. I’m good in programming. I know c++ as well as java.

Hi,

My answer to this question is here: How to choose a good thesis topic in Data Mining?

Hi,

I want to implement pincer search algo in java please provide me source code .

thanks

Priyanka

I don’t know about this algorithm. You may search for source code on Google.

Sir, Your tweets are really helped me a lot to understand what is data mining, data structures, data pattern mining etc. I am eagerly working on item set mining, very soon I will submit my article.

Thanking you Sir,

I am very glad to say you are an information packaged.

Dear Deepak,

Glad that everything has been helpful. And thanks for letting me know 😉

Wish you good luck for your first paper. It is great that you are preparing an article.

Philippe

Hi Please how do I get the code source of the algorithm closet

Some possibilities : (1) contact the authors, (2) search for open-source implementations (i am not aware of any, but many you can find some), (3) implement it by yourself, (4) consider using another algorithm instead.

Hello Sir,

Thanks you so much for the information… I found all your blogs very informative… I have tried Apriori algorithms in Java and it worked also. Please suggest what algorithm should i implement now for HUIM???

Regards

Hello,

Great. What is your goal? You want to implement an algorithm for learning something and for fun? Or for writing a research paper? If it is for writing a research papers, there are many possibilities. Generally, you can either make a faster algorithm, or combine two topics to make a new topic, and then write an algorithm an publish a paper. For example, recently, I combined Periodic pattern mining + HUIM = Periodic HUIM and I wrote a paper. This is just an example, but you can combine any two topics in pattern mining to generate new topics like that. Then, the topic needs to be interesting (not trivial) and generate some new challenges. If you want to do something for HUIM, you could start from the FHM algorithm which is fast and still simple to understand. Then you could modify it to do something else.

Hope this gives you some idea.

Philippe

Hello Sir

I have learnt a lot from your posts. From writing a good research paper to understanding specific technical challenges in data mining…

I would like to ask a question and would be highly thankful if you can answer. How can we optimize a rule based system? Just as you explained about optimizing a data mining algorithm, can you suggest something about optimizing rules for data mining?

Hi glad you like the blog 😉 I will try to answer your question. In general, when we optimize something, we need to optimize with respect to some goal. I think it depends what is your goal. For example, if your goal is to discover rules and to find a small set of rules that give a good accuracy for prediction, then your goal is accuracy and you may apply some method to try to find an optimal or nearly optimal solution for that goal. But there could be other goals besides accuracy. You could optimize with respect to other goals. Now, which method to use to find an optimal set of rules? Well, you could use an exact algorithm. That means potentially testing all the possibilities to find the truly optimal solution. Or you could use some algorithms that are designed to find approximate solutions to hard problems such as Genetic Algorithms, Particle Swarm Optimizations and other evolutionary algorithms. These algorithm can find nearly optimal solutions to hard problems but there are no guarantee that they will find the truly best solution. Hope this helps.

Thank you very much Sir for the detailed response, indeed helpful.