The Data Blog

How to search for a research advisor by e-mail?

Posted on 2013-06-25 by Philippe Fournier-Viger

In this blog post, I will talk about how to search for a research advisor by e-mail.

I will talk about this because today, I received an e-mail from a Ph.D student from abroad asking to work with me as a post-doc on the topic of “Web Services”. Let’s have a look at the e-mail and then I will discuss the problems with this e-mail.

From: XXXXXXXX@researchabroad.net
Dear Professor Fournier-Viger,
My name is XXXXXX. I am interested in many areas, including but not limited to “XXXXX”. I am very interested in applying for a postdoctoral position in your lab.
I completed my Ph.D XXXXXXXX majored in XXXX, from XXXXX University, in XXXXXX. Before that, I focused on XXXXXXX both in Master and Bachelor studies.
My research goal is to provide a novel service model XXXXXXXXXXX and so on.
I have many years’ experience in service computing research. And the areas I can pursue is as following,
Mulit-agent research;
Services computing research;
XXXXXXXXX
XXXXXXXXXXXx
XXXXXXXXXX
XXXXXXXXXXXX
I would be grateful if you would give me the opportunity to work in your group. The attached is my CV for your review.
I am eagerly looking forward to hearing from you.
Sincerely yours,
XXXXXXXXXX

When I read this e-mail, I see right away that this message was probably sent to hundreds or thousands of professors. The reason why I get this impression is that I’m not working on “web services” and that the student write about HIS research interests instead of talking about why he is interested in working with me. When I receive this kind of e-mails, I usually delete them and I know that several other professors in other universities do the same. On the other hand, if I receive a personalized message from a student that explain why he wants to work for me, I will take the time to read it carefully and answer it.

The advice that I wanted to give in this post is that to be successful when searching for a research advisor by e-mail, it is important to write personalized e-mails to each professor, and to choose professors related to your field. It takes more time. But it will be more successful. This is what I did when looking for a post-doc position when I was a Ph.D. student and it worked very well.

This is what I wanted to write for today. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in General, Research | Tagged advisor, phd, Research | 3 Comments

What are the steps to implement a data mining algorithm?

Posted on 2013-06-13 by Philippe Fournier-Viger

In this post, I will discuss what are the steps that I follow to implement a data mining algorithm. The subject of this post comes from a question that I have received by e-mail recently, and I think that it is interesting to share the answer. The question was :

When we do he programming I’ve observed a good programmer always makes a flow chart, data flow diagram, context diagram etc. to make the programming error free. (…) In your implementation of XXXX algorithm, have you performed any procedures as discussed above? (…) Can you explain me what are the basic procedures that you have applied to implement XXXX algorithm?

When I implement a data mining algorithm, I don’t do any flowchart or UML design because I consider that a single data mining algorithm is a small project. If I was programming larger software program, I would think about object-oriented design and UML. But for a single algorithm, it does not need to be an object oriented design. Actually, what is the most important for a data mining algorithm is that the algorithm produces the correct result, is fast and preferably that the code is well-documented and clean.

Step 1: understanding the algorithm. For implementing a data mining algorithm, the first step that I perform is to read the research paper describing it and make sure that I understand it well. Usually, I need to read the paper a few times to understand it. If there is an example in the paper I will try to understand it and later I will use this example to test the algorithm to see if my implementation is correct. After I read the paper a few times, I may still not understand some details about the algorithm. But this is ok. There is often some tricky details that you may only understand when doing the implementation because the authors do not always give all the details in the paper, sometimes due to the lack of space. Especially, it is common that authors do not describe all the optimizations and data structures that they have used in a research paper.

Step 2: implementing a first draft of the algorithm step by step. Then, I start to implement the algorithm. To implement the algorithm I print the pseudocode and i try to implement it. For an algorithm that takes a file as input, I will first work on the code for reading the input file. I will test this code extensively to make sure that I read the input file correctly. Then, I will add additional details to perform the first operations of the algorithm. I will check if the intermediary result is correct by comparing with the example in the paper. If it is not correct I will debug and maybe read the paper again to make sure that I did not make a mistake because I did not understand something in the paper. Then I will continue until the algorithm is fully implemented.

Step 3: testing with other input files. When my implementation becomes correct, I will try with a few more examples, to make sure that it is not correct for a single example but that it can provide the correct result for other input files.

Step 4: cleaning the code. After that, I will clean my code because the first draft is likely not pretty.

Step 5: optimizing the code. Then I will try to optimize the algorithm in terms of (1) using better data structures, (2) simplifying the code by removing unecessary operations, etc. For example, for my implementation of PrefixSpan in my SPMF data mining software, I first made a very simple implementation without an optimization called pseudo-projection that is described in the paper. It was very slow. After my implementation was correct, I took a few days to optimize it. For example, I added the pseudo-projection, I also added code for another optimization which is to remove infrequent items after the first input file scan, I removed some unnecessary code that I had left, I reorganized some code, I added some comments, etc.

Step 6: Comparison of the performance with other implementations of the same algorithm / peer review. After my code is optimized, as an optional sixth step, I may compare the performance of my implementation with other implementations of the same algorithm if some are available on the Internet in the same programming language.If my implementation is slower, I may look at the source code of the other implementation to see if there is some ideas that I have not thought of that could be used to further optimize my code. I may also ask some of my friends or colleagues to review my code. Another good way is to not show the code to your colleague but just to explain them the main idea to get their feedback. Discussing with other people is a good way to learn.

It takes times… Note that being good at programming data mining algorithms takes time. For example, let me tell you about my story. The first time that I implemented data mining algorithms was in december 2007. I implemented the Apriori algorithm for a course project at university. My implementation was terrible and slow… But it generated the correct result. I then implemented PrefixSpan in 2008. At that time, my code was better because I was gaining some experience on implementing this kind of algorithms. Then in 2010, I read my Apriori and PrefixSpan code again and I still found some problems and I optimized them again. What I want to say here is that it is normal that the first implementation(s) of data mining algorithms that one person makes may not be very good. But after implementing a few algorithms, it becomes much easier. Now, we are in 2013 and I have implemented more than 45 algorithms in my open-source SPMF Java data mining software!

This is what I wanted to write for today. Hope that you enjoyed this post. If you want to read more on this topic, you may be interested by my post on How to be a good data mining programmer. Lastly, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in Data Mining, Programming | Tagged algorithm, data mining, design, implementation, programming | 45 Comments

Choosing data structures according to what you want to do

Posted on 2013-06-12 by Philippe Fournier-Viger

Today, I write a post about programming. I want to share a simple but important idea for writing optimized code. The idea is to choose data structures according to what you want to do instead of what you want to store. This idea is simple. But I write this post because it addresses a common beginner’s misconception which is to think of data structures solely in terms of what they can store.

For example, a beginner programmer may think that he should use an array or a list because s/he want to store some items in a given order. Or simply because s/he wants to store a set of single values. To store two dimensional data, a simple idea is to use a two dimensional array, etc. That is a simple reasoning that is fine for implementing a program that works.

However, to write an optimized program, it is important to think further about how the data will be used. For example, consider that you need to write a program where you have to store a long list of integer values that is updated periodically (add and remove) and where you want to quickly find the minimum and maximum value. If a programmer thinks about what he need to store, s/he may decide to use an array. If the programmer thinks in terms of what he want to do with the data, s/he may decide to use a list (an array that is dynamically resized) because add and remove operations will be performed periodically. This could be a better solution. However, if the programmer thinks further in terms of what he want to do with the data, he may decide to use a red-black tree, which guarantees a O(log(n)) worst-case time cost for the four operations add, remove, minimum and maximum. This could be a much better solution!

Is it therefore important to take the time to find appropriate data structures if one’s wants to write optimized code. Also note that the execution time is important but the memory usage is also sometimes very important.

To show you an example of what is the impact of choosing appropriate data structures on performance, I here compare three versions of TopKRules, an algorithm for mining top-k association rules in a transaction database. TopKRules needs to store a list of candidates and a list of k best rules and perform add, remove, minimum and maximum operations. Furthermore, it needs to be able to quickly perform the intersection of two sets of integers. The next chart shows a performance comparison in terms of execution times of three versions of TopKRules when a parameter k increases and the problem become more difficult, for a dataset called mushrooms.

Version A is TopKRules implemented with lists.
Version B is TopKRules implemented with bitsets to quickly perform the intersection by doing the logical AND operation.
Version C is TopKRules implemented with bitsets plus using red-black trees for storing candidates and best k rules for quickly performing add, remove minimum and maximum.

As you can see from this chart, there is quite a large improvement in performance by using appropriate data structures!

That’s all I wanted to write for today. Hope that you enjoyed this post. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in Data Mining, Programming | Tagged association rules, bitset, optimization, programming, red-black tree, topkrules | Leave a comment

Analyzing the source code of the SPMF data mining software

Posted on 2013-05-16 by Philippe Fournier-Viger

Hi everyone,

In this blog post, I will discuss how I have applied an open-source tool that is named Code Analyzer ( http://sourceforge.net/projects/codeanalyze-gpl/ ) to analyze the source code of my open-source data mining software named SPMF.

I have applied the tool on the previous version (0.92c) of SPMF, and the tool prints the following result:

Metric               Value
——————————-   ——–
    Total Files                     360
Total Lines                   50457
Avg Line Length                  30
    Code Lines                   31901
    Comment Lines               13297
Whitespace Lines                6583
Code/(Comment+Whitespace) Ratio        1,60
Code/Comment Ratio                2,40
Code/Whitespace Ratio            4,85
Code/Total Lines Ratio            0,63
Code Lines Per File                  88
    Comment Lines Per File              36
Whitespace Lines Per File              18

Now, what is interesting is the difference when I apply the same tool on the latest version of SPMF (0.93). It gives the following result:

Metric               Value
——————————-   ——–
    Total Files                     280
Total Lines                   53165
Avg Line Length                  32
    Code Lines                   25455
    Comment Lines               23208
Whitespace Lines                5803
Code/(Comment+Whitespace) Ratio        0,88
   Code/Comment Ratio                1,10
Code/Whitespace Ratio            4,39
Code/Total Lines Ratio            0,48
Code Lines Per File                  90
    Comment Lines Per File              82
Whitespace Lines Per File              20

As you can see by these statistics, I have done a lot of refactoring for the latest version. There is now 280 files instead of 360 files. Moreover, I have shrunk the code from 31901 lines to 25455 lines, without removing any functionnalities!

Also, I have added a lot of comments to SPMF. The “Code/Comment” ratio has thus changed from 2.40 to 1.10, and the “Comment Lines per files” went up from 36 to 82 lines. Totally, there is now around 10,000 more lines of comments than in the previous version (the number of lines of comments has increased from 13297 to 23208).

That’s all I wanted to write for today! If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in Data Mining, Java, open-source, Pattern Mining, Programming, spmf | Tagged data mining, java, open-source, software, spmf | 2 Comments

How to auto-adjust the minimum support threshold according to the data size

Posted on 2013-05-11 by Philippe Fournier-Viger

Today, I will do a quick post on how to automatically adjust the minimum support threshold of frequent pattern mining algorithms such as Apriori, FPGrowth and PrefixSpan according to the size of the data.

The problem is simple. Let’s consider the Apriori algorithm. The input of this algorithm is a transaction database and a parameter called minsup that is a value in [0,1]. The output of Apriori is a set of frequent patterns. A frequent pattern is a pattern such that its support is higher or equal to minsup. The support of a pattern (also called “frequency”) is the number of transactions that contains the pattern divided by the total number of transactions in the database. A key problem for algorithms like Apriori is how to choose a minsup value to find interesting patterns. There is no really easy way to determine the best minsup threshold. Usually, it is done by trial and error. Now let’s say that you have determined that for your application the best minsup.

Now, consider that the size of your data can change over time. In this case how can you dynamically adjust the minimum support so that when you don’t have a lot of data the threshold is higher? and that when you have more data, the threshold becomes lower ?

A simple solution that I have found is to use a mathematical function for adjusting the minsup threshold automatically according to the database size (the numer of transactions for Apriori). This solution is shown below.

I choose to use the formula minsup(x) = (e^(-ax-b))+c where x is the number of transactions in the database and a,b,c are constants because it sets minsup to a high value when there is not a lot of data and then decrease minsup when there is more data. For example, on the first chart below, minsup value is set to 0.7 if there is 1 transaction, it becomes 0.4 when there is 3 transactions and then decrease to 0.2 when there is around 9 transactions. This allow the minsup threshold to become more strict when there is less data. Note that the constantsa,b and c are can be adjusted to make the curve behave differently.

On the second chart above, I show the relative minsup threshold. It is simply the minsup threshold multiplied by the database size. It shows the number of transactions in which a pattern need to appear to become frequent according to the database size.

Hope this little trick is helpful!

Next time, I will try to talk about some more general data mining topic. I promised that I would do that last time. But today I wanted to talk about this topic, so I will rather do that next time!

Philippe
Founder of the SPMF data mining software

Posted in Data Mining, Programming | Tagged apriori, fpgrowth, frequent pattern mining, itemset mining, minsup, prefixspan | 48 Comments

How to characterize and compare data mining algorithms?

Posted on 2013-04-15 by Philippe Fournier-Viger

Hi, today, I will discuss how to compare data mining algorithms. This is an important question for data mining researchers who want to evaluate which algorithm is “better” in general or for a given situation. This question is also important for researchers who are writing articles proposing new data mining algorithms and want to convince the reader that their algorithms deserve to be published and used. We can characterize and compare data mining algorithms from several different angles. I will describe them thereafter.

What is the input and output? For some problems there exists several algorithms. However, some of them have slightly different input and this can make the problem much harder to solve. Often, a person who want to choose a data mining algorithm will look at the most popular algorithms such as ID3, Apriori, etc. But several persons will not try to search for less popular algorithms that could better fit their need. For example, for the problem of association rule mining, the classic algorithm is Apriori. However, there exists probably hundreds of variations that can be more appropriate for a given situation such as discovering associations with weights, fuzzy associations, indirect associations and rare associations.
Which data structures are used ? The data structure that are used will often have a great impact on the performance of data mining algorithms in terms of execution time and memory. For example, some algorithms will use less common data structures such as bitsets, KD-trees and heaps.

What is the problem-solving strategy? There are many way to describe the strategy used by an algorithm for finding a solution such as: Is it a depth-first search algorithm? a breadth-first search algorithm? a recursive algorithm? an iterative algorithm? is it a brute force algorithm? an exhaustive algorithm? a divided-and-conquer algorithm? a greedy algorithm? a linear programming algorithm? etc. Moreover,

Does the algorithm belong to a well-known class of algorithms? For example, there exists several well-known classes of algorithms for solving problems such as genetic algorithms, quantum algorithms, neural networks and particle swarm optimizations based techniques.

The algorithm is an approximate algorithm or an exact algorithm? An approximate algorithm is an algorithm that will not always guarantee to return the correct answer. An exact algorithm always return the correct answer.

Is it a randomized algorithm ? Does the algorithm uses a random generator? Is it possible that the algorithms will return different results if it is run twice on the same data?

Is it an interactive, incremental, or a batch algorithm? A “batch algorithm” is an algorithm that takes as input some data, will perform some processing and will output some data. An incremental algorithm is an algorithm that will not need to recompute the output from zero, if new data arrives. An interactive algorithm is an algorithm where the user can influence the processing of the algorithm, while the algorithm is working.

How is the performance? Ideally, one should compare the performance of an algorithm with at least another algorithm for the same problem, if there exists one. To make a good performance comparison, it is important to make a detailed performance analysis. First, we need to determine what will be measured. This depends on what kind of algorithms we are evaluating. In general, it is interesting to measure the maximal memory usage and the execution time of an algorithm. For other algorithms such as classification algorithms, one will also take other measurements such as the accuracy or recall, for example. Another important decision is what kind of data should be used to evaluate the algorithms. To compare the performance of algorithms, more than one datasets should be used. Ideally, one should use datasets having various characteristics. For example, to evaluate frequent itemset mining algorithms, it is important to use sparse and dense datasets, because some algorithms will perform very well on dense datasets but not so well on sparse datasets. Is is also preferable to use real datasets instead of synthetic datasets. Another thing that one may want to evaluate is the scalability of the algorithms. This means to measure how algorithms behave in terms of speed/memory/accuracy when the datasets becomes larger. To assess the scalability of algorithms, some researchers will use a dataset generator that can generate random datasets of various sizes and having various characteristics.

What is the complexity? One can make a more or less detailed complexity analysis of an algorithm to assess its general performance. It can help gain insights on what is the cost of using the algorithm. However, one should also perform experiments on real data to asses the performance. A reason for not just doing a complexity analysis, is that the performance of many algorithms will vary depending on the data.

The algorithm is easy to implement? One important aspect that can make an algorithm more popular is if it is easy to implement or easy to understand. I have often observed that some algorithms are more popular than others simply because they are easier to implement although there exists more efficient algorithms that are more difficult to implement. Also, if there exist open-source implementations of the algorithms, it can also be a reason why an algorithm is preferred.

Does the algorithms terminate in a finite amount of time ? For data mining algorithms, this is generally the case. However, some algorithms does not guarantee this.

The algorithm is a parallel algorithm? Or could it be easily transformed into a parallel algorithm that could be run on a distributed system or parallel system such as a multiprocessor computer or a cluster of computers? If an algorithm offers this possibility, it is an advantage for scalability.

What are the applications of the algorithm? One could also talk about what are the potential applications of an algorithm. Can it only be applied to a narrow domain? Or does it add a very general problems that can be found in many domains?

This is all for today. If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in Data Mining, Programming, Research | Tagged algorithms, characteristics, classification, comparison, data mining, evaluation | 9 Comments

A Map of Data Mining Algorithms (offered in SPMF v092c)

Posted on 2013-04-09 by Philippe Fournier-Viger

Hi,

I have made a map to visualize the relationship between the 52 different data mining algorithms offered in the SPMF data mining software. You can view it in PNG format by clicking on the picture below:

Or you can view it in SVG format :
https://www.philippe-fournier-viger.com/spmf/map_algorithms_spmf_data_mining092.svg

If you have any comments to improve the map of algorithms, please post your comments in the comment section.

Philippe

Posted in Data Mining, Java, open-source, Pattern Mining, Programming, spmf | Tagged algorithms, data mining, java, map, open-source, spmf | 2 Comments

On the quality of peer-review for academic journals (updated)

Posted on 2013-04-09 by Philippe Fournier-Viger

Nowadays, there are a lot of low-quality academic journal popping everywhere on the Web. Actually, it does not take much to start a low-quality journal (just a website). What is the goal of low-quality journals? It is generally to earn money by charging huge publishing fees for example. Some of these journals will have low peer-review standard and will even accept almost any journal papers that are submitted to them. However, it is important for any researchers to avoid these kind of journals. It can look very bad on a CV when it is time to apply for a job.

Why I’m talking about this topic? It is because today I have received the following e-mail. It is an invitation for reviewing an article for a journal that I have never heard of before. Let’s have a look at it:

Prudence Journal of Business Management
www.prudencejournals.org/pjbm/index.htm

Invitation to Review: PJBM-XXXXXXX

Dear Colleague,

A manuscript titled: XXXXXXXXXXXXXXXXX.

We will be most grateful if you could find time to review the manuscript.
Please find below the abstract:

Abstract

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I am looking forward to hearing from you.

Regards,

Lincoln Otuya
Editorial Assistant
Prudence Journal of Business Management
E-mail: pjbm@prudencejournals.org

So you may ask me what is wrong with this e-mail? The problem is that I’m not even working in business management and I have never published any article on this topic! I’m a computer science researcher working on data mining. I have no qualifications for reviewing paper outside of computer science. I don’t even know how they found my mailing address.

I don’t know this journal. But it does not give me a very good impression.

Update (2013-05-11): I receive two comments on this blog post. These comments seems to contradict this post and say that Prudence is a good journal. However, from the blog administration tool, i can see that both comments are posted by the same IP address within 10 minutes from each other and with two different names…

Ariet
XXXXXXXXXXXXXXXX@yahho.com
41.203.67.51
Submitted on 2013/05/01 at 5:06 PM

I receive a mail from prudence journal of medicinal plants research, initially i thought it was a scam (…)i really appreciate the journal.

wilson
XXXXXXXXXXXXX@yahoo.com
41.203.67.51
Submitted on 2013/05/01 at 4:59 PM

to my view, i think is just a mixed up issue, (…) prudence journal of business management actually did a good job in its peer review

Moreover, these IP addresses are from the same country as the website of this journal. So this raises doubts about who wrote these comments….

Philippe

Ariet ojes4ever@yahho.com 41.203.67.51	Submitted on 2013/05/01 at 5:06 PM I receive a mail from prudence journal of medicinal plants research, initially i thought it was a scam, i later receive the same mail again on call for papers, i now check the website of the journal and discovered that its a functional journal because i saw published articles that are well articulating so i forwarded my manuscripts to them. currently i have gotten feedback from them and soonest my article will be publish. i really appreciate the journal. Approve \| Reply \| Quick Edit \| Edit \| Spam \| Trash	On the quality of peer-review for academic journals 0 View Post
Select comment	wilson okinedowilson@yahoo.com 41.203.67.51	Submitted on 2013/05/01 at 4:59 PM to my view, i think is just a mixed up issue, from your comment you are not a business researcher but i think you would have reply the journal by simply mailing them that business issue is not your line of interest. many a times i receive invitation from these new journals i just reply them if i am not interested. i receive invitation from the journal too on call for paper and i sent my manuscript to them, to my surprise prudence journal of business management actually did a good job in its peer review

Posted in Research | Tagged academic journal, journal, peer-review | Leave a comment

Why researchers should make their research papers available on internet?

Posted on 2013-03-30 by Philippe Fournier-Viger

In this blog post, I will discuss the importance of making research papers available on the internet.

As you probably knows, many researchers nowadays prefer to search for papers/articles on the internet instead of searching at their university’s library. Searching on internet is more convenient, although there is a risk that several papers are not available online and therefore to miss some important parts of the literature.

Now, given that so many people do their literature review online, it is crucial to make sure that your research is visible on the internet.

If your papers are available on the Web, it is more likely that other researchers will read your papers and cite you.

But don’t forget that even if your papers are available on some publisher websites, not all universities pay the subscriptions for accessing these websites. Therefore, it is possible that several researchers may not have access to your papers.

So how to publish my papers?

First, I would recommend to create a website if you don’t have one already so that you have a web presence. Then, you could consider posting some of your research papers on your website as PDF. To do that, you should check that it is in accordance with the copyright policy of your publisher. For example, some publishers allow posting reprint version of papers on an author’s website.

Second, you could register to social networking websites for researchers like ResearchGate, Academia.edu, CiteUlike and upload your research papers (again, according to the copyright policy). For example, the papers posted on ResearchGate can be indexed by Google Scholar.

Third, you could post your papers (or a preprint version of your paper) to archiving services such as ArXiv or Citeseer (again, according to the copyright policy). This can also provide a good visibility to your papers.

Fourth, another solution is to publish your papers in open-access journals or conferences. However, if you choose this option, you need to be careful about choosing a conference or journal that has a good reputation. There are some open-access journal that do not have a good reputation.

Provide some additional contents on your website

After putting some of your papers online, you can increase the interest on your research by offering other online contents related to your papers on your website. For example, you could add links to your website to download source code, datasets and even powerpoint presentations related to your research. You could also create a blog and Twitter account to discuss issues related to your research and link to your PDFs and website. This will help to promote your website and contents.

Actually, it is possible to optimize a researchers’ website by using all the tricks that people use in the field of Search Engine Optimization (SEO). This includes: choosing a good title for your website, having other websites link to your website, etc.

Editing the metadata of your PDF before publishing on the web

Another important thing to consider when publishing your papers to the web as PDF is to make sure that the metadata of your PDF is correctly set. This is very important! If your metadata is not set correctly, it is possible that the title of your paper will not appear correctly in search services such as Google Scholar.

How to fix this?

If you are using Microsoft Word, go to the properties of your document and change the metadata (title, authors, keywords) before saving your file as PDF.
If your file is already as a PDF, then use a free PDF editing software to edit the metadata of your PDF.

Choosing the name of your PDF carefully

Another important thing to consider before publishing your research papers as PDF to the web is to choose an appropriate filename for your PDF. For example, if you published a paper about social network mining at the ABC 2012 conference , then don’t name your paper “ABC.pdf” or “ABC2012.pdf”. You should instead choose a name that precisely describes the subject of your paper such as “social_network_mining_for_researchers.pdf”. This will make it more likely that other researchers will find your paper and click on it.

If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in General, Research | Tagged internet, research papers, website | 1 Comment

What does it takes to do a good Ph.D?

Posted on 2013-03-27 by Philippe Fournier-Viger

Today, I will discuss this question: “What does it takes to do a good Ph.D ?“. We can answer this question from several points of view.

First, from a personal level, some people think that “talent” is the most important thing to have to do a good Ph.D. Actually, in my opinion, it is 10 % talent and 90 % how hard you work. Of course, you need to have good knowledge of computer science, if you are doing a Ph.D. in computer science. But ultimately, you need to spend a lot of time to read papers, think about ideas, etc. The more time you will spend, the more your research will get better.

Second, it is important to have a good research team to work with and a good professor to supervise you (see my post on choosing a professor). If you have a good research team, you may be able to collaborate on the projects of other students. This will allow you do get more publications. Moreover, it can allows you to get more feedback on your research project and to learn to work with other researchers, and to learn from them. Working with a famous or semi-famous professor in your field will also helpyou get more funding for your research (and therefore to spend more time on your project) and get other opportunities (reviewing articles for conferences, participating in committees, etc.).

Third, you need to have a good research topic (see my post on choosing a Ph.D. topic). Actually, doing a Ph.D is about doing research. You will make a good Ph.D. if you can work on a significant problem and bring an original solution that is a least a small but important contribution to your field. So a key step is to choose a good topic. If you choose a topic that is uninteresting, then no matter how hard you will work, your research will hardly have any impact.

Fourth, you need to make a good planning. You should set some goals and some deadlines for achieving these goals. In particular, you should make a plan of where you want to submit your papers and when. If you have a good plan, it will help you focus on your goals.

Fifth, you should choose carefully where you publish your papers/articles (see my post: how to choose a conference for publishing research papers?). Often it is more valuable to have a single paper in a top level conference/journal than to have several papers in bad or average conferences/journals. Therefore, it is sometimes good to resist the urge to publish and to take more times to improve your research and publish in a better conference/journal.

Sixth, try to get some other opportunities outside of your Ph.D that will help your future career. For example, you could participate in the organization of a workshop with your professor or help him to review some papers.

Seventh, if you want to work in academia eventually, you may want to teach a course or two during your Ph.D. to get some experience and to see if you like it.

Those are my main advices on this topic. If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

Posted in General, Research | Tagged Ph.D., Research | Comments Off

How to search for a research advisor by e-mail?

What are the steps to implement a data mining algorithm?

Choosing data structures according to what you want to do

Analyzing the source code of the SPMF data mining software

How to auto-adjust the minimum support threshold according to the data size

How to characterize and compare data mining algorithms?

A Map of Data Mining Algorithms (offered in SPMF v092c)

On the quality of peer-review for academic journals (updated)

Why researchers should make their research papers available on internet?

What does it takes to do a good Ph.D?

Archives

Categories

Recent Posts

Recent Comments

Number of visitors:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Archives

Categories

Recent Posts

Recent Comments

Tag cloud

Number of visitors: