How to auto-adjust the minimum support threshold according to the data size

Today, I will do a quick post on how to automatically adjust the minimum support threshold of frequent pattern mining algorithms such as Apriori, FPGrowth and PrefixSpan according to the size of the data.

adjust minsup threshold

The problem is simple.  Let’s consider the Apriori algorithm.  The input of this algorithm is a transaction database and a parameter called minsup that is a value in [0,1].  The output of Apriori is a set of frequent patterns. A frequent pattern is a pattern such that its support is higher or equal to minsup. The support of a pattern  (also called “frequency”) is the number of transactions that contains the pattern divided by the total number of transactions in the database.  A key problem for algorithms like Apriori is how to choose a minsup value to find interesting patterns.   There is no really easy way to determine the best minsup threshold. Usually, it is done by trial and error. Now let’s say that you have determined that for your application the best minsup.

Now, consider that the size of your data can change over time.  In this case how can you dynamically adjust the minimum support so that when you don’t have a lot of data the threshold is higher?   and that when you have more data, the threshold becomes lower ?

The solution

A simple solution that I have found is to use a mathematical function for adjusting the minsup threshold automatically according to the database size (the number of transactions for Apriori).  This solution is shown below.

I choose to use the formula minsup(x) = (e^(-ax-b))+c  where x is the number of transactions in the database and a,b,c are positive constants. This allows to set minsup to a high value when there is not a lot of data and then decrease minsup when there is more data.  For example, on the first chart below, minsup value is set to 0.7 if  there is 1 transaction, it becomes 0.4 when there is 3 transactions and then decrease to 0.2 when there is around 9 transactions. This allow the minsup threshold to become more strict when there is less data. Note that the constants a,b and c can be adjusted to make the curve behave differently.

minsup threshold

On the second chart above, I show the relative minsup threshold. It is simply the minsup threshold multiplied by the database size.  It shows the number of transactions in which a pattern need to appear to become frequent according to the database size.

What is the meaning of the constants a, b, and c?

The constant is the smallest value that this function will produce. For example, if c = 0.2, the function will not generate minsup values that are less than 0.2. The constant a and b influences how quickly the curve will decrease when x increases.

How do we call this function?

In mathematics, this type of function is called an exponential decay function as it is exponentially decreasing when x increases. The idea of using this function for pattern mining was first proposed in my Ph.D thesis:

Fournier-Viger, P. (2010), Un modèle hybride pour le support à l’apprentissage dans les domaines procéduraux et mal-définis. Ph.D. Thesis, University of Quebec in Montreal, Montreal, Canada, 184 pages.

Conclusion

Hope this little trick is helpful!

Next time, I will try to talk about some more general data mining topic. I promised that I would do that last time. But today I wanted to talk about this topic, so I will rather do that next time!

Philippe
Founder of the SPMF data mining software

How to characterize and compare data mining algorithms?

Hi, today, I will discuss how to compare data mining algorithms.  This is an important question for data mining researchers who want to evaluate which algorithm is “better” in general or for a given situation.  This question is also important for researchers who are writing articles proposing new data mining algorithms and want to convince the reader that their algorithms deserve to be published and used.  We can characterize and compare data mining algorithms from several different angles. I will describe them thereafter.

balance

What is the input and output?  For some problems there exists several algorithms. However, some of them have slightly different input and this can make the problem much harder to solve. Often, a person who want to choose a data mining algorithm will look at the most popular algorithms such as ID3, Apriori, etc. But several persons will not try to search for less popular algorithms that could better fit their need.  For example, for  the problem of association rule mining,  the classic algorithm is Apriori.  However, there exists probably hundreds of variations that can be more appropriate for a given situation such as discovering associations with weights, fuzzy associations, indirect associations and rare associations.
Which data structures are used ?   The data structure that are used will often have a great impact on the performance of data mining algorithms in terms of execution time and memory. For example, some algorithms will use less common data structures such as bitsets, KD-trees and heaps.

What is the problem-solving strategy?  There are many way to describe the strategy used by an algorithm for finding a solution such as:   Is it a depth-first search algorithm? a breadth-first search algorithm? a recursive algorithm? an iterative algorithm? is it a brute force algorithm? an exhaustive algorithm?   a divided-and-conquer algorithm?  a greedy algorithm? a linear programming algorithm? etc.  Moreover,

Does the algorithm belong to a well-known class of algorithms? For example, there exists several well-known classes of algorithms for solving problems such as genetic algorithms, quantum algorithms, neural networks and particle swarm optimizations based techniques.

The algorithm is an approximate algorithm or an exact algorithm? An approximate algorithm is an algorithm that will not always guarantee to return the correct answer. An exact algorithm always return the correct answer.

Is it a randomized algorithm ? Does the algorithm uses a random generator? Is it possible that the algorithms will return different results if it is run twice on the same data?

Is it an  interactive, incremental, or a batch algorithm?  A “batch algorithm” is an algorithm that takes as input some data, will perform some processing and will output some data.  An incremental algorithm is an algorithm that will not need to recompute the output from zero, if new data arrives.   An interactive algorithm is an algorithm where the user can influence the processing of the algorithm, while the algorithm is working.

How is the performance? Ideally, one should compare the performance of an algorithm with at least another algorithm for the same problem, if there exists one. To make a good performance comparison, it is important to make a detailed performance analysis.  First, we need to determine what will be measured.  This depends on what kind of algorithms we are evaluating.  In general, it is interesting to measure the maximal memory usage and the execution time of an algorithm.  For other algorithms such as classification algorithms, one will also take other measurements such as the accuracy or recall, for example. Another important decision is what kind of data should be used to evaluate the algorithms.  To compare the performance of algorithms, more than one datasets should be used. Ideally, one should use datasets having various characteristics.  For example, to evaluate frequent itemset mining algorithms, it is important to use sparse and dense datasets, because some algorithms will perform very well on dense datasets but not so well on sparse datasets.  Is is also preferable to use real datasets instead of synthetic datasets.  Another thing that one may want to evaluate is the scalability of the algorithms. This means to measure how  algorithms behave in terms of speed/memory/accuracy when the datasets becomes larger.  To assess the scalability of algorithms, some researchers will use a dataset generator that can generate random datasets of various sizes and having various characteristics.

What is the complexity? One can make a more or less detailed complexity analysis of an algorithm to assess its general performance. It can help gain insights on what is the cost of using the algorithm. However, one should also perform experiments on real data to asses the performance. A reason for not just doing a complexity analysis, is that the performance of many algorithms will vary depending on the data.

The algorithm is easy to implement? One important aspect that can make an algorithm more popular is if it is easy to implement or easy to understand.  I have often observed that some algorithms are more popular than others simply because they are easier to implement although there exists more efficient algorithms that are more difficult to implement. Also, if there exist open-source implementations of the algorithms, it can also be a reason why an algorithm is preferred.

Does the algorithms terminate in a finite amount of time ? For data mining algorithms, this is generally the case.  However, some algorithms does not guarantee this.

The algorithm is a parallel algorithm?  Or could it be easily transformed into a parallel algorithm that could be run on a distributed system or parallel system such as a multiprocessor computer or a cluster of computers?  If an algorithm offers this possibility, it is an advantage for scalability.

What are the applications of the algorithm? One could also talk about what are the potential applications of an algorithm. Can it only be applied to a narrow domain? Or does it add a very general problems that can be found in many domains?

This is all for today. If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

A Map of Data Mining Algorithms (offered in SPMF v092c)

Hi,

I have made a map to visualize the relationship between the 52 different data mining algorithms offered in the SPMF data mining software.  You can view it in PNG format by clicking on the picture below:

map_algorithms_spmf_data_mining092_small

Or you can view it in SVG format :
http://www.philippe-fournier-viger.com/spmf/map_algorithms_spmf_data_mining092.svg

If you have any comments to improve the map of algorithms, please post your comments in the comment section.

Philippe

On the quality of peer-review for academic journals (updated)

Nowadays, there are a lot of low-quality academic journal popping everywhere on the Web. Actually, it does not take much to start a low-quality journal (just a website). What is the goal of low-quality journals?  It is generally to earn money by charging huge publishing fees for example.  Some of these journals will have low peer-review standard and will even accept almost any journal papers that are submitted to them. However, it is important for any researchers to avoid these kind of journals. It can look very bad on a CV when it is time to apply for a job.

Why am I writing about this topic? It is because today I have received the following e-mail. It is an invitation for reviewing an article for a journal that I have never heard of before. Let’s have a look at it:

Prudence Journal of Business Management
www.prudencejournals.org/pjbm/index.htm

Invitation to Review: PJBM-XXXXXXX

Dear Colleague,

A manuscript titled: XXXXXXXXXXXXXXXXX.

We will be most grateful if you could find time to review the manuscript.
Please find below the abstract:

Abstract

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I am looking forward to hearing from you.

Regards,

Lincoln Otuya
Editorial Assistant
Prudence Journal of Business Management
E-mail: pjbm@prudencejournals.org

So you may ask me what is wrong with this e-mail? The problem is that I’m not even working in business management and I have never published any article on this topic! I’m a computer science researcher working on data mining. I have no qualifications for reviewing paper outside of computer science. I don’t even know how they found my mailing address.

I don’t know this journal. But it does not give me a very good impression.

Update (2013-05-11): I receive two comments on this blog post.  These comments seems to contradict this post and say that Prudence is a good journal.  However,  from the blog administration tool, i can see that both comments are posted by the same IP address within 10 minutes from each other and with two different names…

Ariet
XXXXXXXXXXXXXXXX@yahho.com
41.203.67.51
Submitted on 2013/05/01 at 5:06 PM

I receive a mail from prudence journal of medicinal plants research, initially i thought it was a scam (…)i really appreciate the journal.

wilson
XXXXXXXXXXXXX@yahoo.com
41.203.67.51
Submitted on 2013/05/01 at 4:59 PM

to my view, i think is just a mixed up issue, (…) prudence journal of business management actually did a good job in its peer review

Moreover, these IP addresses are from the same country as the website of this journal. So this raises doubts about who wrote these comments….

Philippe

Ariet
ojes4ever@yahho.com
41.203.67.51

I receive a mail from prudence journal of medicinal plants research, initially i thought it was a scam, i later receive the same mail again on call for papers, i now check the website of the journal and discovered that its a functional journal because i saw published articles that are well articulating so i forwarded my manuscripts to them. currently i have gotten feedback from them and soonest my article will be publish. i really appreciate the journal.

wilson
okinedowilson@yahoo.com
41.203.67.51

to my view, i think is just a mixed up issue, from your comment you are not a business researcher but i think you would have reply the journal by simply mailing them that business issue is not your line of interest. many a times i receive invitation from these new journals i just reply them if i am not interested. i receive invitation from the journal too on call for paper and i sent my manuscript to them, to my surprise prudence journal of business management actually did a good job in its peer review

Why researchers should make their research papers available on internet?

In this blog post, I will discuss the importance of making research papers available on the internet.

papers

As you probably knows, many researchers nowadays prefer to search for papers/articles on the internet instead of searching at their university’s library. Searching on internet is more convenient, although there is a risk that several papers are not available online and therefore to miss some important parts of the literature.

Now, given that so many people do their literature review online, it is crucial to make sure that your research is visible on the internet.

If your papers are available on the Web, it is more likely that other researchers will read your papers and cite you.

But don’t forget  that even if your papers are available on some publisher websites, not all  universities pay the subscriptions for accessing these websites. Therefore, it is possible that several researchers may not have access to your papers.

So how to publish my papers?

First, I would recommend to create a website if you don’t have one already so that you have a web presence. Then, you could consider posting some of your research papers on your website as PDF.  To do that, you should check that it is in accordance with the copyright policy of your publisher.  For example, some publishers allow posting reprint version of papers on an author’s website.

Second, you could register to social networking websites for researchers like ResearchGate, Academia.edu, CiteUlike and upload your research papers (again, according to the copyright policy). For example, the papers posted on ResearchGate can be indexed by Google Scholar.

Third, you could post your papers (or a preprint version of your paper) to archiving services such as ArXiv or Citeseer (again, according to the copyright policy). This can also provide a good visibility to your papers.

Fourth, another solution is to publish your papers in open-access journals or conferences.  However, if you choose this option, you need to be careful about choosing a conference or journal that has a good reputation. There are some open-access journal that do not have a good reputation.

Provide some additional contents on your website

After putting some of your papers online, you can increase the interest on your research by offering other online contents related to your papers on your website. For example, you could add links to your website to download source code, datasets and even powerpoint presentations related to your research. You could also create a blog and Twitter account to discuss issues related to your research and link to your PDFs and website. This will help to promote your website and contents.

Actually, it is possible to optimize a researchers’ website by using all the tricks that people use in the field of Search Engine Optimization (SEO). This includes: choosing a good title for your website, having other websites link to your website, etc.

Editing the metadata of your PDF before publishing on the web

Another important thing to consider when publishing your papers to the web as PDF is to make sure that the metadata of your PDF is correctly set. This is very important!  If your metadata is not set correctly, it is possible that the title of your paper will not appear correctly in search services such as Google Scholar.

How to fix this?

  • If you are using Microsoft Word, go to the properties of your document and change the metadata (title, authors, keywords) before saving your file as PDF.
  • If your file is already as a PDF, then use a free PDF editing software to edit the metadata of your PDF.

Choosing the name of your PDF  carefully

Another important thing to consider before publishing your research papers as PDF to the web is to choose an appropriate filename for your PDF.  For example, if you published a paper about social network mining at the ABC 2012 conference , then don’t name your paper “ABC.pdf” or “ABC2012.pdf”. You should instead choose a name that precisely describes the subject of your paper such as “social_network_mining_for_researchers.pdf”.  This will make it more likely that other researchers will find your paper and click on it.

If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

What does it takes to do a good Ph.D?

Today, I will discuss this question: “What does it takes to do a good Ph.D ?“.  We can answer this question from several points of view.

First, from a personal level, some people think that “talent” is the most important thing to have to do a good Ph.D.  Actually, in my opinion, it is 10 % talent and 90 % how hard you work.  Of course, you need to have good knowledge of computer science, if you are doing a Ph.D. in computer science. But ultimately, you need to spend a lot of time to read papers, think about ideas, etc.  The more time you will spend, the more your research will get better.

Second, it is important to have a good research team to work with and a good professor to supervise you (see my post on choosing a professor). If you have a good research team, you may be able to collaborate on the projects of other students. This will allow you do get more publications. Moreover, it can allows you to get more feedback on your research project and to learn to work with other researchers, and to learn from them.  Working with a famous or semi-famous professor in your field will also helpyou get more funding for your research (and therefore to spend more time on your project) and get other opportunities (reviewing articles for conferences, participating in committees, etc.).

Third, you need to have a good research topic (see my post on choosing a Ph.D. topic). Actually, doing a Ph.D is about doing research.  You will make a good Ph.D. if you can work on a significant problem and bring an original solution that is a least a small but important contribution to your field. So a key step is to choose a good topic. If you choose a topic that is uninteresting, then no matter how hard you will work, your research will hardly have any impact.

Fourth, you need to make a good planning. You should set some goals and some deadlines for achieving these goals. In particular, you should make a plan of where you want to submit your papers and when.  If you have a good plan, it will help you focus on your goals.

Fifth, you should choose carefully where you publish your papers/articles (see my post: how to choose a conference for publishing research papers?). Often it is more valuable to have a single paper in a top level conference/journal than to have several papers in bad or average conferences/journals. Therefore, it is sometimes good to resist the urge to publish and to take more times to improve your research and publish in a better conference/journal.

Sixth,  try to get some other opportunities outside of your Ph.D that will help your future career. For example, you could participate in the organization of a workshop with your professor or help him to review some papers.

Seventh,  if you want to work in academia eventually, you may want to teach a course or two during your Ph.D. to get some experience and to see if you like it.

Those are my main advices on this topic.  If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

How to choose a research advisor for M.Sc. / Ph.D ?

In this post I will discuss how to choose a research advisor for doing a M.Sc. or Ph.D.  This is a very important decision for any graduate students that can have an important impact on their success, and on their carreer.  So what should a graduate student consider to choose a research advisor?

search

First, it is important to choose a professor that is working in a field or on a project that you like.  The reason is that you will work on your M.Sc. or Ph.D. project for perhaps 2 or more years. So for example, if you are interested in data mining, you should choose a professor that is working in the field of data mining.  To know which professors works on which topics, one can look at their website or talk with them directly. Ideally, one should choose a professor that is famous  or semi-famous in his field to get more opportunities.

Second, you may also consider the reputation of the university where the professor is working. But in my opinion, this is not so important.  student. What is much more important than the institution is the research that you will do and the articles that you will publish (how good your research is).

Third, you need to consider what the research advisor will provide to you. Ideally, the professor will provide you with (1) funding ($$$), (2) time (to discuss your research with you) and (3) good opportunities (e.g. participate in a book, in a committee of a conference). However, the most likely is that you cannot get all three. It is more likely a “pick 2 out of 3” or “pick 1 out of 3”, because professors who will provide the best opportunities and have money are likely to be very busy. But don’t worry. There is a strategy to get around this. It is to get a co-advisor.  In this case, you could for example get a main advisor with money and opportunities, and a co-advisor who will give you time.

Fourth, I have mentioned the money already. But I will mention it again because it is very important.  Some professors have funding and will be able to pay you for your studies.  In my opinion, this should be one of the first thing that you should discuss with a potential advisor. If you can get money, then you will perhaps get less debt when you finish your studies.

Fifth, you could also consider the research environment. Does the potential supervisor has a good research team?  Will there be some opportunities to collaborate with other students and publish some joint publications? Those are important questions. Actually, when you do a M.Sc. or Ph.D.,  I recommend to try to work with other researchers if possible (on your project on on their project). This will help you get more publications.

Sixth, you should have a look at how good the professor is.  Is he active in research? I mean, does he publishes many articles in good to excellent conferences/journals?). Does other researchers cite him ?  DoIn my opinion, one should try to get a professor that is as active as possible in research.

Seventh, does the professor as a good social network with other professors in your field of interest? If so, this could create some interesting opportunities for you.

Eight, you should also consider the location of the university. Will you need to travel to another city?  How much it will cost ? go far away from home?  etc.

Those are my main advices on this topic. Hope that it helps. If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

How to choose a conference for publishing research papers?

Today, I will discuss how to choose conferences for publishing papers.  It is important to make good choices because it can have a huge impact on a researcher’s career.

conference

There are several things to consider for choosing a conference.

First, does the conference have a good reputation?  Obviously, it is better to publish papers in conferences that have very good reputation. If someone publishes in good conferences, it will give more visibility to his work, and thus, it is more likely that other researchers will cite his papers. It will also look better in his CV and gives him a better chance of getting hired for jobs or getting grants.

However, the best conferences are sometimes very selective. For example, the top conferences in some fields like data mining can have acceptance rate below 10 %, which means that only 1 paper out of 10 or even less may be accepted. Therefore, it sometimes makes sense to submit papers to conferences with lesser reputation. However, one should avoid at all cost the conferences that have bad reputations. For instance,  I know that for hiring professors at some universities, if someone has published in some conferences that I will not name, it will negatively affect the candidate’s chance of getting hired.

Another thing to consider is how difficult it is to get a paper accepted at a given conference? For top-level conferences, one needs to have very good research results to get accepted and also to write the paper very well. So it is important to ask this question: Does my paper has a good chance of getting accepted? To answer this question, one may read papers that were published in the conference proceedings the years before. It will give him an idea about how hard it is to get accepted.  One thing that every researcher should know is that the “acceptance rate” of a conference that is sometimes advertised does not always reflect very well the difficulty of having a paper accepted.  For example, some top conferences could have a 10 % acceptance rate, while some other may have a 20% acceptance rate. But it does not means that it is twice harder to get accepted for the former conference.  Actually,  the one with a 10 % acceptance rate could be much harder if it is a conference with a good reputation because it will be 10 % of the best papers instead of 20 % of some average papers.

Another important aspect to consider for choosing a conference  is the location of the conference.  The location is important because it does not cost the same amount of money to travel to every countries.  Moreover, the registration fee of some conferences is cheaper than some others. Besides, a researcher should also think about which conference will provide him with the best opportunity to meet researchers that could be interested in his research to build collaborations and give him the best visibility.

The deadline of a conference and the review time is also important.  I personally recommend to write down the conference dates and notification date of several conferences, and then to use this to make a plan.  Where should I submit?  If the paper is not accepted at conference A, then where could I submit my paper after that?

Also one should consider the format.  This is very important because the format of papers and the maximum number of pages can vary widely from one conference to another. Moreover, one should check carefully if the pages are single-column or double-column. This can also make a huge difference on the overall length of the paper.

One should also check who publish the conference proceedings.  Does the proceedings are published by a serious publisher or are they printed by the conference organizers at a local store?  I recommend to only publish papers in conferences that are published by serious publishers and/or indexed in publication databases in your field. This is important because if someone publish papers in conference that are not indexed, in ten years from now, it is possible that nobody would know that these papers ever existed.

Another aspect to consider is the topic of the conference.  Let’s say that someone is working on developing data mining algorithms that are applied to educational data.  He could publish his research in several different conferences depending on the topic of the conferences. For example, he could publish his research in an educational data mining conference. He could submit to a data mining conference (educational data mining is a subfield of data mining). Alternatively, he could publish in an artificial intelligence conference (data mining is a subfield of artificial intelligence). Or he could publish in a very general Computer Science conference (artifical intelligence is a subfield of Computer Science).  My advice is to not choose a conference that is too general.

Those are my advices for choosing a conference. Hope that this helps you! If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

How to choose a good thesis topic in Data Mining?

I have seen many people asking for help in data mining forums and on other websites about how to choose a good thesis topic in data mining.  Therefore, in this this post, I will address this question.

The first thing to consider is whether you want to design/improve data mining techniques, apply data mining techniques or do both.  Personally, I think that designing or improving data mining techniques is more challenging than using already existing techniques.  Moreover, you can make a more fundamental contribution if you work on improving data mining techniques instead of applying them. However, you need to be aware that improving data mining techniques may require better algorithmic and/or mathematics skills.

The second thing to consider is what kind of techniques you want to apply or design/improve? Data mining is a broad field consisting of many techniques such as neural networks, association rule mining algorithms, clustering and outlier detection. You should try to get some overview of the different techniques to see what you are more interested in. To get a rough overview of the field, you could read some introduction books on data mining such as the book by Tan, Steinbach & Kumar (Introduction to data mining) or read websites and articles related to data mining. If your goal is just to apply data mining techniques to achieve some other purpose (e.g. analysing cancer data) but you don’t know which one yet, you could skip this question.

The third thing to consider is  which problems you want to solve or what you want to improve.  This requires more thoughts.  A good way is to look at recent good data mining conferences  (KDD, ICDM, PKDD, PAKDD, ADMA, DAWAK, etc.) and journals (TKDE, TKDD, KAIS, etc.), or to attend conferences, if possible, and talk with other researchers.  This helps to see what are the current popular topics and what kind of problems researchers are currently trying to solve.  It does not mean that you need to work on the most popular topic. Working on a popular topic (e.g. social network mining) has several advantages. It is easier to get grants or in some case to get your papers accepted in special issues, workshops, etc. However, there are  also some “older” topics that are also interesting even if they are not the current flavor of the day. Actually, the most important is that you find a topic that you like and will enjoy working on it for perhaps a few years of your life. Finding a good problem to work on can require to read several articles to understand what are the limitations of current techniques and decide what can be improved.  So don’t worry. It is normal that it takes time to find a more specific topic.

Fourth,  one should not forget that helping to choose a thesis topic is also the job of the professor that supervise the Master or Ph.D Students. Therefore, if you are looking for a thesis topic, it is good to talk with your supervisor and ask for suggestions. He should help you.  If you don’t have a supervisor yet, then  try to get a rough idea of what you like, and try to meet/discuss with professors that could become your supervisors. Some of them will perhaps have some research projects and ideas that they could give  you if you work with them. Choosing a supervisor is a very important and strategic decision that every graduate student has to make.  For more information about choosing a supervisor, you can read this post : How to choose a research advisor for M.Sc. / Ph.D ?

Lastly, I would like to discuss the common question   “please give me a Ph.D. topic in data mining“, that I read on websites and that I sometimes receive in my e-mails. There are two problems with this question. The first problem is that it is too general. As mentioned, data mining is a very broad field. For example, I could suggest you some very specific topics such as detecting outliers in imbalanced stock market data or to optimize the memory efficiency of subgraph mining algorithms for community detection in social networks. But will you like it? It is best to choose something by yourself that you like. The second problem with the above question is that choosing a topic is the work that a researcher should do or learn to do. In fact, in research, it is equally important to be able to find a good research problem as it is to find a good solution. Therefore, I highly recommend to try to find a research topic by yourself, as it is important to develop this skill to become a successful researcher. If you are a student, when searching for a topic, you can ask your research advisor to guide you.

Also, just for fun, here is a Ph.D thesis title generator.

If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.

How to become a good data mining programmer?

In this post, I will discuss what it takes to be a good data mining programmer and how to become one.

programmer

Data mining is a broad field that can be approached from several angles. Some people with a mathematical background will employ a statistical approach to data mining and use statistical tools to study data. Others will use already made  commercial or open-source data mining software to analyses their data. In this post, we will discuss the computer science view of data mining. It is aimed at programmers who would like to become good at implementing and designing data mining algorithms.

There are some great benefits to not just be a user, but to be a data mining programmer.  First, you can implement algorithms that are not offered in existing data mining tools. This is important because several data mining tools are restricted to a small set of algorithms. For example, if you consider data mining tasks such as clustering, there are hundreds of algorithms that have been proposed to handle many different scenarios. However, general purpose data mining tools often only offer just a few algorithms. Second, you can download open-source algorithms and adapt them to your needs. Third, you could eventually design your own data mining algorithms and implement them efficiently.

So now that we have talked about the advantages, let’s talk about how to become a good data mining programmer.  We can break this down into two aspects: being good at programming and being knowledgeable at computer science in general, and being good at programming data mining algorithms.

To be good at programming, you should have good knowledge of at least one programming language that you will use. Choosing a programming language is important because performance is generally important in data mining. So you may go for a language like C++ that will compile to machine code, or some languages like Java or C# that are reasonably fast and can be more convenient to use. You should avoid web languages such as PHP and Javascript that are less efficient, unless you have some good reasons to use them.

After that, you should try to get a good knowledge of the data structures that are offered in your programming language. A good programmer should  know when to use the different data structures.  This is important because you will eventually optimize your algorithms. In data mining, optimizations can make the difference between an algorithm that will run for hours or just a few minutes, or use gigabytes or megabytes of memory!   So you should get to know the main data structures that are offered such as array lists, linked list, binary trees, hash tables, hash sets, bitsets, priority queue (heaps).  But more importantly, you should know that there are many data structures that are not offered with your programming language. You should know how to look up in books or websites for other data structures.

Besides, you should try to get better at algorithmic  (designing efficient algorithms) and computer science in general. There are many different way to do that such as taking courses on this topic or to read some books.  But  most importantly, you need to to put the theory into practice and to do some programming, which leads me to the key part of this post.

To become good at programming data mining algorithms, you need to write data mining algorithms.  To get started, you should read some data mining books such as the book by Tan, Steinbach & Kumar, or the book by Han & Kamber. I recommend to start by implementing some simple algorithms without optimizations. For example, K-means or Apriori are relatively easy to implement. After you have debugged and checked that your implementation generates the correct result, you should spend time to think about how to optimize it.  First,  think about optimizations by yourself. Then look at how other people did it by looking at websites, articles or by looking at the code of other people. Most likely, there are many optimizations that have been proposed. After that, you could implement the optimizations, and then look at more complex algorithms.  Finally, remember that Rome was not built in a day.  Give yourself some time to learn!

I have obviously not mentioned everything. In particular, being good at mathematics is also important. If you have some additional thoughts, you can share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about next blog posts.