Brief report about the IEA AIE 2016 conference

This week, I have attended the IEA AIE 2016 conference, held at MoriokaJapan from the 2nd to the 4th August 2016. In this blog post, I will briefly discuss the conference.

c

About the conference

IEA AIE 2016 (29th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems) is an artificial intelligence conference with a focus on applications of artificial intelligence. But this conference also accepts theoretical papers, as well as data science papers.  This year, 85 papers have been accepted. The proceedings has been given on a  USB drive:

a

There was also the option to buy the printed proceedings, though:

IEA AIE 2016 proceedings

First day of the conference

Two keynote speeches were given on the first day of the conference. For the second keynote, I have seen about 50 persons attending the conference. Many people were coming from Asia, which is understandable since the conference is held in Japan. 
b

The paper presentations on the first day were about topics such as Knowledge-based Systems, Semantic Web, Social Network (clustering, relationship prediction), Neural Networks, Evolutionary Algorithms and Heuristic Search, Computer Vision and Adaptive Control.

Second day

On the second day of the conference, there was a great keynote talk by Prof. Jie Lu from Australia about recommender systems. She first introduced the main recommendation approaches (content-based filtering, collaborative filtering, knowledge-based recommendation, and hybrid approaches) and some of the typical problems that recommender systems face (cold-start problem, sparsity problem, etc.).  She then talked about her recent work on extensions of the recommendation problem such as group recommendation (e.g. recommending a restaurant that will globally satisfy or not disapoint a group of persons), trust-based recommendation (e.g. a system that recommend products to you based on what friends that you trust have liked, or the friends of your friends),  fuzzy recommender systems (a recommender systems that consider each item can belong to more than one category), and cross-domain recommendation (e.g. if you like reading books about Kung Fu, you may also like watching movies about Kung Fu).

After that there was several paper presentations. The main topics were Semantic Web, Social networks, Data Science, Neural Networks, Evolutionary Algorithms, Heuristic Search, Soft Computing and Multi-Agent Systems.

In the evening, there was a great banquet at the Morioka Grand Hotel, a nice hotel located on a mountain, on the outskirts of the city.

k3

Moreover, during the dinner, there was a live music band:

k2

Third day

On the third day, there was an interesting keynote on robotics by Prof. Hiroshi Okuno from the Waseda University / University of Tokyo. His team has proposed an open-source sofware called Hark for robot audition. Robot audition means the general process by which a robot can process sounds from the environment. The sofware which is the results of years of resutls has been used in robots equipped with arrays of microphones. By using the Hark library, robots can listen to multiple persons talking to the robot at the same time, localize where the sounds came from, and isolate sounds, among many other capabilities.

k1

It was followed by paper presentations on topics such as Data Science (KNN, SVM, itemset mining, clustering), Decision Support Systems,  Medical Diagnosis and Bio-informatics, Natural Language Processing and Sentiment Analysis. I have presented a paper about high utility itemset mining for a new algorithm called FHM+.

Location

The location of the conference is Morioka, a not very large city in Japan. However, the timing of the conference was perfect. It was held during the Sansa Odori festival, one of the most famous festival in Japan. Thus, during the evenings, it was possible to watch the Sansa parade, where people wearing traditional costumes where playing Taiko drums and dancing in the streets.

efef

Conclusion

The conference has been quite interesting. Since it is a quite general conference, I did not discuss with many people that were close to my research area. But I met some interesting people, including some top researchers.  During the conference, it was announced that the conference IEA AIE 2017 will be held in Arras (close to Paris, France).

==

Philippe Fournier-Viger is a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

 

Posted in artificial intelligence, Conference, Data Mining, Data science, Research | Tagged , , | Leave a comment

How to give a good paper presentation at an academic conference?

In this blog post, I will discuss how to give a good paper presentation at an academic conference. If you are a researcher, this is an important topic for you researcher because giving a good presentation of your work will raise interest in your work.  In fact, a good researcher should not only be good at doing research, but also be good at communicating the results of this research in terms of written and oral communication.

Rule 1 : Prepare yourself, and know the requirements

Giving a good oral presentation starts with a good preparation.  One should not prepare his presentation the day before the presentation but a few days before, to make sure that there will be enough time to prepare well. A common mistake is to prepare a presentation the evening before the presentation. In that case, the person may finish preparing the presentation late, not sleep well, be tired, and as a result give a poor presentation.

Preparing a presentation does not means only to design some Powerpoint slides. It also means to practice your presentation. If you are going to give a paper presentation at a conference, you should ideally practice giving your presentation several times in your bedroom or in front of friends before giving the presentation in front of your audience. Then, you will be more prepared, you will feel less nervous,  and you will give a better presentation.

It is also important to understand the context of your presentation: (1) who will attend the presentation? (2) how long the presentation should be ? (3) what kind of equipment will be available to do the presentation (projector, computer, room, etc.) ?, (4) what is the background of the people attending the presentation?   These questions are general questions that needs to be answered to help you prepare an oral presentations.

Who will attend the presentation is important. If you do a presentation in front of experts from your field the presentation should be different than if you present to your friend, your research advisor, or to some kids. A presentation should always be adapted to the audience.

To avoid having bad surprises, it is always better to check the equipment that will be available for your presentation and prepare some backup plan in case that some problems occur. For example, one may bring his laptop and a copy of his presentation on a USB drive as well as a copy in his e-mail inbox, just in case.

It is also important to know the expected length of the presentation. If the presentation at a conference should last no more than 20 minutes, for example, then one should make sure that the presentation will not last more than 20 minutes. At an academic conference, it is quite possible that someone will stop your presentation if you exceed the time limit. Moreover, exceeding the time limit may be seen as disrespectful.

Rule 2 :  Always look at your audience

When giving a presentation, there are a few important rules that should always be followed. One of the most important one is to always look at your audience when talking.

One should NEVER read the slides and turn his back to the audience for more than a few seconds. I have seen some presenters that did not look at the audience for long periods of time at academic conferences, and it is one of the best way to annoy the audience. For example, here are some pictures that I have took at an academic conference.

Not looking at the audience

Not looking at the audience

Turning your back to audience

Turning your back to audience

In that presentation, the presenter barely looked at the audience. Either he was looking at the floor (first picture) when talking or either he was reading the slides (second picture). This is one of the worst thing to do, and the presentation was in fact awful. Not because the research work was not good. But because it was poorly presented. To do a good presentation, one should try to look at the audience as much as possible. It is OK sometimes to look at a slide to point something out, but it should not be more than a few seconds.  Otherwise, the audience may lose interest in your presentation.

Personally, when I give a presentation, I look very quickly at the computer screen to see some keywords and remember what I should say, and then I look at the audience to talk. Then, when I go to the next slide, I will briefly look at the screen of my computer to remember what I should say on that slide and then I continue looking at the audience while talking. Doing this results in much better presentation. But it may require some preparation. If you practice giving your talk several times in your bedroom for example, then you will become more natural and you will not need to read your slides when it will be the time to give your presentation in front of an audience.

Rule 3:  Talk loud enough

Other important things to do is to talk LOUD enough when giving a presentation, and speak clearly. Make sure that even the people in the back of the room can hear your clearly. This seems like something obvious. But several times at academic conferences, there are some presenters who do not speak loud enough, and it becomes very boring for the audience, especially for those in the back of the room.

Rule 4:  Do not sit 

Another good advice is to stand when giving a presentation. I have ever seen some people giving a presentation while seated. In general, if you are seated, then you will be less “dynamic”. It is always better to stand up to give a presentation.

Rule 5:  Make simple slides

A very common problem that I observed in presentations at academic conferences is that presenters put way too much content on their slides. I will show you some pictures that I took at an academic conference for example:

Slides with too many formulas

A slide with too many formulas

In this picture, the problem is that there are too many technical details, and formulas. It is impossible for someone attending a 20 minutes presentations with slides full of formulas to read, understand and remember all these formulas, with all these symbols.

In general, when I give a presentation at a conference, I will not show all the details, formulas, or theorems.  Instead, I will only give some minimum details so that the audience understand the basic idea of my work: what is the problem that I want to solve and the main intuition behind the solution. And I will try to explain some applications of my work and show some illustrations or simple examples to make it easy to understand. Actually, the goal of a paper presentation is that the audience understand the main idea of your work.  Then, if someone from the audience wants to know all the technical details, he can read your paper.

If someone do like in the picture above by giving way too much technical details or show a lot of formulas during a presentations, then the audience will very quickly get lost in the details and stop following the presentation.

Here is another example:

Slides with too much content

A slide with way too much content

In the above slide, there are way too much text. Nobody from the audience will start to read all this text.  To make a good presentation, you should try to make your slides as simple as possible. You should also not put full sentences but rather just put some keywords or short parts of sentences. The reason is that during the presentation you should not read the slides and the audience should also not read a long text on your slides. You should talk and the audience should listen to you rather than be reading your slides.  Here is an example of a good slide design:

A powerpoint slide with just enough content

A powerpoint slide with just enough content

This slide has just enough content. It has some very short text that give only the main points. And then the presenter can talk to explain these points in more details while looking at the audience rather than reading the slides.

Conclusion

There are also a lot of other things that could be said about giving a good presentation but I did not want to write too much for today. Actually, giving good presentations is something that is learned through practice. The more that you practice giving presentations, the more that you will become comfortable to talk in front of many people.

Also, it is quite normal that a student may be nervous when giving a presentation especially if it is in a foreign language. In that case, it requires more preparation.

Hope that you have enjoyed this short blog post.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Conference, General, Research | Leave a comment

Brief report about the 12th International Conference on Machine Learning and Data Mining conference (MLDM 2016)

In this blog post, I will provide a brief report about the 12th Intern. Conference on Machine Learning and Data Mining (MLDM 2016), that I have attended from the 18th to 20th July 2016 in New York, USA.

About the conference
This is the 12th edition of the conference. The MLDM conference is co-located and co-organized with the 16th Industrial Conference on Data Mining 2016, that I have also attended this week. The proceedings of MLDM are published by Springer. Moreover, an extra book was offered containing two late papers, published by Ibai solutions.

The MLDM 2016 proceedings

The MLDM 2016 proceedings

Acceptance rate

The acceptance rate of the conference is about 33% (58 papers have been accepted from 169 submitted papers), which is reasonable.

First day of the conference

The first day of the MLDM conference started at 9:00 with an opening ceremony, followed by a keynote on supervised clustering. The idea of supervised clustering is to perform clustering on data that has already some class labels. Thus, it can be used for example to discover sub-class in existing classes. The class labels can also be used to evaluate how good some clusters are. One of the cluster evaluation measure suggested by the keynote speaker is the purity, that is the percentage of instances having the most popular class label in a cluster. The purity measure can be used to remove outliers from some clusters among other applications.

After the keynote, there was paper presentations for the rest of the day. Topics were quite varied. It included paper presentations about clustering, support vector machines, stock market prediction, list price optimization, image processing, automatic authorship attribution of texts, driving style identification, and source code mining.

The conference room

The conference room

The conference ended at around 17:00 and was followed by a banquet at 18:00. There was about 40 persons attending the conference in the morning. Overall, there was  some some interesting paper presentations and discussion.

Second day of the conference

The second day was also a day of paper presentations.

Second day of the conference

Second day of the conference (afternoon)

The topics of the second day included itemset mining algorithms, inferring geo-information about persons, multigroup regression, analyzing the content of videos, time-series classification, gesture recognition (a presentation by Intel) and analyzing the evolution of communities in social networks.

I have presented two papers during that day (one by me and one by my colleague), including a paper about high-utility itemset mining.

Third day of the conference

The third day of the conference was also paper presentations. There was various topics such as image classification, image enhancement, mining patterns in cellular radio access network data, random forest learning, clustering  and graph mining.

Conclusion

It was globally an interesting conference. I have attended both the Industrial Conference on Data Mining and MLDM conference this week. The MLDM is more focused on theory and the Industrial Conference on Data Mining conference is more focused on industrial applications. MLDM is a slightly bigger conference.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Conference, Data Mining, Data science | 1 Comment

Brief report about the 16th Industrial Conference on Data mining 2016 (ICDM 2016)

In this blog post, I will provide a brief report about the 16th Industrial Conference on Data mining 2016, that I have attended from the 13 to 14 July 2016 in New York, USA.

About the conference
This conference is an established conference in the field of data mining. It is the 16th edition of the conference. The main proceedings are published by Springer, which ensures that the conference proceedings are well-indexed, while the poster proceedings are published by  Ibai solutions.

The proceedings for full papers and posters

The acronym of this conference is ICDM (Industrial Conference on Data Mining). However, it should not be confused with IEEE ICDM (IEEE International Conference on Data Mining).

Acceptance rate

The quality of the papers at this conference has been fine. 33 papers have been accepted and it was said during the conference that the acceptance rate was 33%. It is interesting though that this conference attracts many papers related to industry applications due to the name of the conference. There was some paper presentations from IBM Research and Mitsubishi.

Conference location

In the past, this conference has been held mostly in Germany, as the organizer is German. But this year, it is organized in New York, USA, and more specifically at a hotel in Newark, New Jersey beside New York. It is interesting that the conference is collocated and co-organized with the MLDM conference. Hence, it is possible to spend 9 days in Newark to attend both conferences. About MLDM 2016, you can read my report about this conference here: MLDM 2016.

ccc

First day of the conference

The conference is two days (excluding workshops and tutorials which requires additional fees). The first day was scheduled to start at 9:00 and finish at around 18:00.

The conference room

The conference started by an opening ceremony. After the opening ceremony, there was supposed to be a keynote but the keynote speaker could unfortunately not attend the conference due to personal reasons.

Then, there was paper presentations. The main topics of the papers have been applications in medicine, algorithms for itemset mining, applying clustering to analyze data about divers, text mining, prediction of rainfall using deep learning, analyzing GPS traces of truck drivers, and preventing network attacks  in the cloud using classifiers. I presented a paper about high-utility itemset mining and periodic pattern mining.

After the paper presentations, there was a poster session. One of the poster applied data mining in archaeology, which is a quite unusual application of data mining.

The poster session

Finally, there was a banquet at the same hotel. The food was fine. Since, the conference is quite small, there was only three tables of 10 persons. But a small conference also means a more welcoming atmosphere and more proximity between participants, such that it was easy to discuss with everyone.

Second day of the conference

The second day of the conference started at 9:00. It consisted of paper presentations. The topics of the paper have covered topics such as association rules, alarm monitoring, distributed data mining, process mining, diabetes age prediction and data mining to analyse wine ratings.

The second day

Best paper award

One thing that I think is great about this conference is that there is a 500 € prize for the best paper. Not many conferences have a cash prize for the best paper award.  Also, there was three nominated papers for the best paper award, and each nominee got a certificate. As one of my paper got nominated, I got a nice certificate (though I did not receive the best paper award).

Best paper nomination certificate

Best paper nomination certificate

Conclusion

It was the first time that I have attended this conference. It is a small conference compared to some other conferences. It is not a first tier conference in data mining. But it is still published by Springer. At this conference, I did not met anyone working exactly in my field, but I still had the opportunity to meet some interesting researchers. The MLDM conference that I have also attended after that conference was bigger.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Conference, Data Mining | 3 Comments

The top journals and conferences in data mining / data science

A key question for data mining and data science researchers is to know what are the top journals and conferences in the field, since it is always best to publish in the most popular journals or conferences. In this blog post, I will look at four different rankings of data mining journals and conferences based on different criteria, and discuss these rankings.

TOPDM

1) The Google Ranking of data mining and analysis journals and conferences

A first ranking is the Google Scholar Ranking (https://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_datamininganalysis). This ranking is automatically generated based on the H5 index measure.  The H5 index measure is described as “the h-index for articles published in the last 5 complete years. It is the largest number h such that h articles published in 2011-2015 have at least h citations each“. The ranking of the top 20 conferences and journals is the following:

  Publication h5-index h5-median Type
1 ACM SIGKDD International Conference on Knowledge discovery and data mining 67 98 Conference
2 IEEE Transactions on Knowledge and Data Engineering 66 111 Journal
3 ACM International Conference on Web Search and Data Mining 58 94 Conference
4 IEEE International Conference on Data Mining (ICDM) 39 64 Conference
5 Knowledge and Information Systems (KAIS) 38 52 Journal
6 ACM Transactions on Intelligent Systems and Technology (TIST) 37 68 Journal
7 ACM Conference on Recommender Systems 35 64 Conference
8 SIAM International Conference on Data Mining (SDM) 35 54 Conference
9 Data Mining and Knowledge Discovery 33 57 Journal
10 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 30 56 Journal
11 European Conference on Machine Learning and Knowledge Discovery in Databases (PKDD) 30 36 Conference
12 Social Network Analysis and Mining 26 37 Journal
13 ACM Transactions on Knowledge Discovery from Data (TKDD) 23 39 Journal
14 International Conference on Artificial Intelligence and Statistics 23 29 Conference
15 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 22 30 Conference
16 IEEE International Conference on Big Data 18 30 Conference
17 Advances in Data Analysis and Classification 18 25 Journal
18 Statistical Analysis and Data Mining 17 30 Journal
19 BioData Mining 17 25 Journal
20 Intelligent Data Analysis 16 21 Journal

 

Some interesting observations can be made from this ranking:

  • It shows that some conferences in the field of data mining actually have a higher impact than some journals. For example, the well-known KDD conference is ranked higher than all journals.
  • It appears strange that the KAIS journal is ranked higher than DMKD/DAMI and the TKDD journals, which are often regarded as better journals than KAIS. However, it may be that the field is evolving, and that KAIS has really improved over the years.

2) The Microsoft Ranking of data mining conferences

Another automatically generated ranking is the Microsoft ranking of data mining conferences (http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=7&orderby=1). This ranking is based on the number of publications and citations.  Contrarily to Google, Microsoft has separated rankings for conferences and journals.

Besides, Microsoft offers two metrics for ranking :  the number of citations and the Field Rating. It is not very clear how the “field rating” is calculated by Microsoft. The Microsoft help center describes it as follows: “The Field Rating is similar to h-index in that it calculates the number of publications by an author and the distribution of citations to the publications. Field rating only calculates publications and citations within a specific field and shows the impact of the scholar or journal within that specific field“.

Here is the conference ranking of the top 30 conferences by citations:

Rank Conference Publications Citations
1 KDD – Knowledge Discovery and Data Mining 2063 69270
2 ICDE – International Conference on Data Engineering 4012 67386
3 CIKM – International Conference on Information and Knowledge Management 2636 28621
4 ICDM – IEEE International Conference on Data Mining 2506 18362
5 SDM – SIAM International Conference on Data Mining 708 9095
6 PKDD – Principles of Data Mining and Knowledge Discovery 994 8875
7 PAKDD – Pacific-Asia Conference on Knowledge Discovery and Data Mining 1255 6400
8 DASFAA – Database Systems for Advanced Applications 1251 4001
9 RIAO – Recherche d’Information Assistee par Ordinateur 574 3551
10 DMKD / DAMI – Research Issues on Data Mining and Knowledge Discovery 103 3264
11 DaWaK – Data Warehousing and Knowledge Discovery 503 2648
12 DS – Discovery Science 553 2256
13 Fuzzy Systems and Knowledge Discovery 4626 2171
14 DOLAP – International Workshop on Data Warehousing and OLAP 177 1830
15 IDEAL – Intelligent Data Engineering and Automated Learning 1032 1789
16 WSDM – Web Search and Data Mining 196 1499
17 GRC – IEEE International Conference on Granular Computing 1351 1434
18 ICWSM – International Conference on Weblogs and Social Media 238 1142
19 DMDW – Design and Management of Data Warehouses 70 993
20 FIMI – Workshop on Frequent Itemset Mining Implementations 32 849
21 MLDM – Machine Learning and Data Mining in Pattern Recognition 313 822
22 PJW – Workshop on Persistence and Java 41 712
23 ADMA – Advanced Data Mining and Applications 562 676
24 ICETET – International Conference on Emerging Trends in Engineering & Technology 712 376
25 WKDD – Workshop on Knowledge Discovery and Data Mining 527 342
26 KDID – International Workshop on Knowledge Discovery in Inductive Databases 70 328
27 ICDM – Industrial Conference on Data Mining 304 323
28 DMIN – Int. Conf. on Data Mining 434 278
29 MineNet – Mining Network Data 22 278
30 WebMine – Workshop on Web Mining 15 245

And here is the top 30 conferences by Field rating:

Rank Conference Publications Field Rating
1 KDD – Knowledge Discovery and Data Mining 2063 122
2 ICDE – International Conference on Data Engineering 4012 104
3 CIKM – International Conference on Information and Knowledge Management 2636 67
4 ICDM – IEEE International Conference on Data Mining 2506 56
5 SDM – SIAM International Conference on Data Mining 708 45
6 PKDD – Principles of Data Mining and Knowledge Discovery 994 40
7 PAKDD – Pacific-Asia Conference on Knowledge Discovery and Data Mining 1255 33
8 RIAO – Recherche d’Information Assistee par Ordinateur 574 28
9 DMKD / DAMI – Research Issues on Data Mining and Knowledge Discovery 103 27
10 DASFAA – Database Systems for Advanced Applications 1251 26
11 DaWaK – Data Warehousing and Knowledge Discovery 503 22
12 DOLAP – International Workshop on Data Warehousing and OLAP 177 22
13 DS – Discovery Science 553 20
14 ICWSM – International Conference on Weblogs and Social Media 238 19
15 WSDM – Web Search and Data Mining 196 19
16 DMDW – Design and Management of Data Warehouses 70 19
17 PJW – Workshop on Persistence and Java 41 16
18 FIMI – Workshop on Frequent Itemset Mining Implementations 32 14
19 GRC – IEEE International Conference on Granular Computing 1351 13
20 IDEAL – Intelligent Data Engineering and Automated Learning 1032 13
21 MLDM – Machine Learning and Data Mining in Pattern Recognition 313 13
22 Fuzzy Systems and Knowledge Discovery 4626 11
23 ADMA – Advanced Data Mining and Applications 562 10
24 KDID – International Workshop on Knowledge Discovery in Inductive Databases 70 10
25 ICDM – Industrial Conference on Data Mining 304 9
26 MineNet – Mining Network Data 22 9
27 ESF Exploratory Workshops 17 8
28 TSDM – Temporal, Spatial, and Spatio-Temporal Data Mining 13 8
29 ICETET – International Conference on Emerging Trends in Engineering & Technology 712 7
30 WKDD – Workshop on Knowledge Discovery and Data Mining 527 7

Some observations:

  • The ranking by citations and by field rating are quite similar.
  • The KDD conference is still the #1 conference.  This make sense, and also that the CIKM, ICDM and SDM conferences are among the top conferences in the field
  • PKDD is higher than PAKDD, which are higher than DASFAA and DAWAK as in the Google ranking, and I agree with this.
  • Some conferences were not in the Google ranking like ICDE. It may be because the Google ranking put the ICDE conference in a different category.
  • Microsoft rank DMKD / DAMI as a conference, while it is a journal.
  • The FIMI workshop is also ranked high although that workshops only occurred in 2003 and 2004. Thus, it seems that Microsoft has no restrictions on time. Actually, since the FIMI workshop was not help since 2004, it should not be in this ranking. The ranking would probably be better if Microsoft would consider only the last five years for example.

3)The Microsoft Ranking of data mining journals

Now let’s look at the top 20 data mining journals according to Microsoft, by citations.

Rank Journal Publications Citations
1 IPL – Information Processing Letters 7044 62746
 2 TKDE – IEEE Transactions on Knowledge and Data Engineering 2742 60945
3 CS&DA – Computational Statistics & Data Analysis 4524 24716
4 DATAMINE – Data Mining and Knowledge Discovery 584 19727
5 VLDB – The Vldb Journal 631 17785
6 Journal of Knowledge Management 747 9601
7 Sigkdd Explorations 491 9564
8 Journal of Classification 550 8041
9 KAIS – Knowledge and Information Systems 741 7639
10 WWW – World Wide Web 540 7182
11 INFFUS – Information Fusion 567 5617
12 IDA – Intelligent Data Analysis 477 4167
13 Transactions on Rough Sets 221 1653
14 JECR – Journal of Electronic Commerce Research 122 1577
15 TKDD – ACM Transactions on Knowledge Discovery From Data 110 716
16 IJDWM – International Journal of Data Warehousing and Mining 102 366
17 IJDMB – International Journal of Data Mining and Bioinformatics 132 256
18 IJBIDM – International Journal of Business Intelligence and Data Mining 124 251
19 Statistical Analysis and Data Mining 124 169
20 IJICT – International Journal of Information and Communication Technology 111 125

And here is the top 20 journals by Field Rating.

Rank Journal Publications Field Rating
1 TKDE – IEEE Transactions on Knowledge and Data Engineering 2742 109
 2 IPL – Information Processing Letters 7044 80
3 VLDB – The Vldb Journal 631 61
4 DATAMINE – Data Mining and Knowledge Discovery 584 57
5 Sigkdd Explorations 491 50
6 CS&DA – Computational Statistics & Data Analysis 4524 49
7 Journal of Knowledge Management 747 46
8 WWW – World Wide Web 540 42
9 Journal of Classification 550 37
10 INFFUS – Information Fusion 567 36
11 KAIS – Knowledge and Information Systems 741 33
12 IDA – Intelligent Data Analysis 477 28
13 JECR – Journal of Electronic Commerce Research 122 21
14 Transactions on Rough Sets 221 20
15 TKDD – ACM Transactions on Knowledge Discovery From Data 110 13
16 IJDMB – International Journal of Data Mining and Bioinformatics 132 8
17 IJDWM – International Journal of Data Warehousing and Mining 102 8
18 IJBIDM – International Journal of Business Intelligence and Data Mining 124 7
19 Statistical Analysis and Data Mining 124 7
20 IJICT – International Journal of Information and Communication Technology 111 5

Some observations:

  • The ranking by citations and field rating are quite similar.
  • The TKDE journal is again in the top of the ranking, just like in the Google ranking.
  • It make sense that the VLDB journal is quite high. This journal was not in the Google ranking probably because it is more a database journal than a data mining journal.
  • Sigkdd explorations is also a good journals, and it make sense to be in the list. However, I’m not sure that it should be higher than TKDD and DMKD / DAMI.
  • The KAIS journal is still ranked quite high. This time it is lower than DMKD / DAMI (unlike in the Google Ranking) but still higher than TKDD. This is quite strange. Actually, TKDD is arguably a better journal. As explained in the comment section of this blog post, a reason why KAIS is ranked so high may be because in the past, the journal has encouraged authors to cite papers from the KAIS journal. Besides, it appears that the Microsoft ranking has no restriction on time (it does not consider only the last five years for example).
  • It is also quite strange that “Intelligent Data Analysis” is ranked higher than TKDD.
  • Some journals like WWW and JECR should perhaps not be in this ranking. Although they publish data mining papers, they do not exclusively focus on data mining. And this is probably the reason why they are not in the Google ranking. On overall, the Microsoft ranking seems to be broader than the Google ranking.

4) Impact factor ranking

Now, another popular way of ranking journals is using their impact factor (IF). I have taken some of the top data mining journals above and obtained their Impact Factor from 2014/2015 or 2013, when I could not find the information for 2015. Here is the result:

Journal Impact factor
DMKD/DAMI Data Mining and Knowledge Discovery 2.714
IEEE Transactions on Knowledge and Data Engineering 2.476
Knowledge and Information Systems 1.78
VLDB – The Vldb Journal 1.57
TKDD – ACM Transactions on Knowledge Discovery From Data 1.14
Advances in Data Analysis and Classification 1.03
Intelligent Data Analysis 0.50

Some observations:

  • TKDE and DAMI/DMKD are still among the top journal
  • As in the Microsoft ranking, DAMI/DMKD is above KAIS, which is above TKDD.
  • As pointed out in the comment section of this blog post, it is strange that KAIS is so high, compared for example to TKDD, or VLDB, which is a first-tier database journal. This shows that IF is not a perfect metric.
  • Compared to the Microsoft Ranking, the IF ranking at least has the “Intelligent Data Analysis” journal much lower than TKDD.  This make sense, as TKDD is a better journal.

Conclusion

In this blog post, we have looked at three different rankings of data mining journals and conferences:  the Microsoft ranking, the Google ranking, and the Impact Factor ranking.

All these rankings are not perfect. They are somewhat accurate but they may not always correspond to the actual reputation in the data mining field.  The Google ranking is more focused on the data mining field, while the Microsoft ranking is perhaps too broad, and seems to have no restriction on time. Also, as it can be seen by observing these rankings, different measures yield different rankings. However, there are still some clear trends in these ranking such as TKDE being ranked as one of the top journal and KDD as the top conference in all rankings. The top journals and conferences are more or less the same in each ranking.  But there are also some strange ranks such as KAIS and Intelligent Data Analysis being ranked higher than TKDD in the Microsoft ranking.

Do you agree with these rankings?  Please leave your comments below!

Update 2016-07-19:  I have updated the blog post based on the insightful comments made by Jefrey in the comment section. Thanks!

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Data Mining, Data science, Research | 3 Comments

An introduction to periodic pattern mining

In this blog post I will give an introduction to the discovery of periodic patterns in data. Mining periodic patterns is an important data mining task as patterns may periodically appear in all kinds of data, and it may be desirable to find them to understand  data for taking strategic decisions.

clocks

For example, consider customer transactions made in retail stores. Analyzing the behavior of customers may reveal that some customers have periodic behaviors such as buying some products every week-end such as wine and cheese. Discovering these patterns may be useful to promote products on week-ends, or take other marketing decisions.

Another application of periodic pattern mining is stock market analysis. One may analyze the price fluctuations of stocks to uncover periodic changes in the market. For example, the stock price of a company may follow some patterns every month before it pays its dividend to its shareholders, or before the end of the year.

In the following paragraph, I will first present the problem of discovering frequent periodic patterns, periodic patterns that appear frequently. Then, I will discuss an extension of the problem called periodic high-utility pattern mining that aims at discovering profitable patterns that are periodic. Moreover, I will present open-source implementations of these algorithms.

The problem of mining  frequent periodic patterns

The problem of discovering periodic patterns can be generally defined as follows. Consider a database of transactions depicted below. Each transaction is a set of items (symbols), and transactions are ordered by their time.

A transaction database

A transaction database

Here, the database contains seven transactions labeled T1 to T7. This database format can be used to represent all kind of data. However, for our example, assume that it is a database of customer transactions in a retail store. The first transaction represents that a customer has bought the items a and  c together. For example, could mean an apple, and c could mean cereals.

Having such a database of objects or transactions, it is possible to  find periodic patterns. The concept of periodic patterns is based on the notion of period.

A period is the time elapsed between two occurrences of a pattern.  It can be counted in terms of time, or in terms of a number of transactions. In the following, we will count the length of periods in terms of number of transactions. For example, consider the itemsets (set of items) {a,c}. This itemset has five periods illustrated below. The number that is used to annotate each period is the period length  calculated as a number of transactions.

periodcsac

The first period of {a,c} is what appeared before the first occurrence of {a,c}. By definition, if {a,c} appears in the first transaction of the database, it is assumed that this period has a length of 1.

The second period of {a,c} is the gap between the first and second occurrences of {a,c}. The first occurrence is in transaction T1 and the second occurrence is in transaction T3. Thus, the length of this period is said to be 2 transactions.

The third period of {a,c} is the gap between the second and third occurrences of {a,c}. The second occurrence is in transaction T3 and the third occurrence is in transaction T5. Thus, the length of this period is said to be 2 transactions.

The fourth period of {a,c} is the gap between the third and fourth occurrences of {a,c}. The third occurrence is in transaction T5 and the fourth occurrence is in transaction T6. Thus, the length of this period is said to be 2 transactions.

Now, the fifth period is interesting. It is defined as the time elapsed between the last occurrence of {a,c} (in T6) and the last transaction in the database, which is T7. Thus, the length of this period is also 1 transaction.

Thus, in this example, the list of period lengths of the pattern {a,c} is: 1,2,2,1,1.

Several algorithms have been designed to discover periodic patterns in such databases, such as the PFP-Tree, MKTPP, ITL-Tree, PF-tree,  and MaxCPF algorithms. For these algorithms, a pattern is said to be periodic, if:
(1) it appears in at least minsup transactions, where minsup is a number of transactions set by the user,
(2) and the pattern has no period of length greater than a maxPer parameter also set by the user.

Thus, according to this definition if we consider minsup = 3 and maxPer =2, the itemset {a,c} is said to be periodic because it has no period of length greater than 2 transactions, and it appears in at least 3 transactions.

This definition of a periodic pattern is, however, too strict. I will explain why with an example. Assume that maxPer is set to 3 transactions. Now, consider that a pattern appears every two transactions many times but that only once it appears after 4 transactions. Because the pattern has a single period greater than maxPer, this pattern would be automatically be deemed as non periodic although it is in general periodic. Thus, this definition is too strict.

Thus, in a recent paper, we proposed a solution to this issue. We introduced two new measures called the average periodicity and the minimum periodicity. The idea is that we should not discard a pattern if it has a single period that is too large but should instead look at how periodic the pattern is on average. The designed algorithm is called PFPM. It discovers periodic patterns using a more flexible definition of what is a periodic pattern. A pattern is said to be periodic pattern if:
(1) the average length of its periods denoted as avgper(X) is not less than a parameter minAvg and not greater than a parameter maxAvg.
(2) the pattern has no period greater than a maximum maxPer.
(3) the pattern has no period smaller than a minimum minPer.

This definition of a periodic pattern is more flexible than the previous definition and thus let the user better select the periodic patters to be found. The user can set the minimum and maximum average as the main constraints for finding periodic patterns and use the minimum and maximum as loose constraints to filter patterns having periods that vary too widely.

Based on the above definition of periodic patterns, the problem of mining all periodic patterns in a database is to find all periodic patterns that satisfy the constraints set by the user. For example, if minPer = 1, maxPer = 3, minAvg = 1, and maxAvg = 2, the 11 periodic patterns found in the example database are the one shown in the table below. This table indicates the measures of support (frequency), minimum, average and maximum periodicity for each pattern found:

calculations

As it can be observed in this example, the average periodicity can give a better view of how periodic a pattern is. For example, considers the patterns {a,c} and {e}. Both of these patterns have a largest period of 2 (called the maximum periodicity), and would be considered as equally periodic using the standard definition of a periodic pattern. But the average periodicity of these two patterns is quite different. The average periodicity measure indicates that on average {a,c} appears with a period of 1.4 transactions, while {e} appears on average with a period of 1.17 transaction.

Discovering periodic patterns using the SPMF open-source data mining library

An implementation of the PFPM algorithm for discovering periodic patterns, and its source code, can be found in the SPMF data mining software. This software allows to run the algorithm from a command line or graphical interface. Moreover, the software can be used as a library and the source code is also provided under the GPL3 license. For the example discussed in this blog post, the input database is a text file encoded as follows:

3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

where the numbers 1,2,3,4,5 represents the letters a,b,c,d,e. To run the algorithm from the source code, the following lines of code need to be used:

String output = "output.txt";
 String input = "contextPFPM.txt";
 int minPeriodicity = 1; 
 int maxPeriodicity = 3; 
 int minAveragePeriodicity = 1; 
 int maxAveragePeriodicity = 2; 

// Applying the algorithm
 AlgoPFPM algorithm = new AlgoPFPM();
 algorithm.runAlgorithm("input.txt", "output.txt",  minPeriodicity, maxPeriodicity, minAveragePeriodicity, maxAveragePeriodicity);

The output is then a file containing all the periodic patterns shown in the table.

2 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 5 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 5 3 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 3 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
4 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
4 3 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
1 #SUP: 4 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.4
1 3 #SUP: 4 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.4
5 #SUP: 5 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.1666666666666667
5 3 #SUP: 4 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.4
3 #SUP: 6 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.0

For example, the eighth line represents the pattern {a,c}. It indicates that this pattern appears in four transactions, that the smallest and largest periods of this pattern are respectively 1 and 2 transactions,  and that this pattern has an average periodicity of 1.4 transactions.

Related problems

Note that there also exists extensions of the problem of discovering periodic patterns. For example, another algorithm offered in the SPMF library is called PHM. It is designed to discover “periodic high-utility itemsets” in customer transaction databases. The goal is not only to find patterns that appear periodically, but also to discover the patterns that yields a high profit in terms of sales.

Conclusion

In this blog post, I have introduced the problem of discovering periodic patterns in databases. I have also explained how to use open-source software to discover periodic patterns. Mining periodic patterns is a general problem that may have many applications.

There are also many research opportunities related to periodic patterns, as this subject as not been extensively studied. For example, some possibility could be to transform the proposed algorithms into an incremental algorithm, a fuzzy algorithm, or to discover more complex types of periodic patterns such as periodic sequential rules.

Also, in this blog post, I have not discussed about time series (sequences of numbers). It is also another interesting problem to discover periodic patterns in time series, which requires different types of algorithms.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Data Mining, Data science, open-source, Research, Uncategorized, Utility Mining | 2 Comments

Full-time faculty positions at Harbin Institute of Technology (data mining, statistics, psychology, design…)

Full-time faculty positions at Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China

hit1

The Center of Innovative Industrial Design is currently looking to hire full-time faculty members at the rank of assistant professor, associate professor or professor, with expertise in one of the four following research directions:

RESEARCH DIRECTIONS

1: Industrial product innovation and user perception Detection (HITSZ16-1601001)

The applicants should have a degree in computer science (e.g. artificial intelligence, data mining, machine learning, etc.), psychology (such as cognitive psychology, engineering psychology, quantitative research or psychology measurement, etc.), statistics, or other major related with quantitative research. The applicant should have a doctorate degree from a well-known university. If the applicant has a doctor degree from a domestic university, postdoctoral work experience overseas is preferred. Applicants who have the following professional or interdisciplinary background are preferred: physiological signal sensing and networks construction, physiological and psychological parameters testing and modeling, human-computer interaction and virtual reality, data visualization design, natural language processing, cognitive science and Neurology, etc. Candidates should have an excellent scientific research capability and be fluent in English teaching and communication.

2: Industrial Product Design of human-computer interface (HITSZ16-1601002)

The applicants should have a doctorate degree in mechanical or industrial design, computer science -ergonomics, digital media design or some related fields. The degree should have been obtained from a well-known university at home or abroad. If the doctorate degree has been obtained from a domestic university, postdoctoral work experience overseas is preferred. The candidate should have research experience in -human-computer interface design, including virtual or reality product interface design. The candidate should be proficient in computer programming or utilizing digital product design software. The candidate should have research experience in user research, interface research, and product-related sociological research. The candidate should be familiar with academic frontiers in human-computer interaction area, have excellent scientific research abilities, preferably have experience in interdisciplinary projects, and be fluent in English teaching and communication.

3: Industrial Environmental Design and Digital Research (HITSZ16-1601003)

The candidate should hold a doctorate degree in design from a well-known university. If the candidate holds a doctorate degree from a domestic university, postdoctoral work experience abroad will be preferred. The applicant should have interests in urban and industrial space design, architectural and spatial environment design, transport, equipment operation and other interior design, furniture, lighting, furnishing design and display design, etc.. The applicant should have a solid scientific research capability, be able to work in an interdisciplinary research environment, and be fluent in English teaching and communication.

4: Industrial product modeling and convey visual and auditory research (HITSZ16-1601004)

The candidate should hold a doctorate degree in design, or a related professional doctorate degree in fine arts from a well-known university at home or abroad. If the candidate holds a doctorate degree from a domestic university, postdoctoral work experience abroad will be preferred. The candidate should have professional interest in audio-visual communication design and practice, industrial product design, web design and digital media interface design, etc. The candidate should have excellent scientific research capability, be able to work in an interdisciplinary environment, be fluent in English for teaching and communication, and have strong creative design capabilities.

JOB RESPONSIBILITIES

1. Teaching: The selected candidates will participate in the development of the discipline. S/he will develop and teach graduate and undergraduate courses, and supervise graduate students.

2. Research: The selected candidate should have a clear research direction, and be an active researcher for managing national, provincial, or municipal research projects or business research projects. He/she should have publications such as monographs, textbooks or papers in top journals as first author or corresponding author.

3. Discipline Construction: Work actively in discipline Construction, chair or participate in the discipline or teaching team.

4. Experiment: the candidate should actively participate in the development of the center

5. Other: it is expected that the selected candidate will carry out international cooperation and academic exchange activities, will chair or participate in international academic conferences, to expand the visibility and influence of the center and school; and that the selected candidate will actively participate in community service work, and perform other tasks stipulated in his/her contract.

HOW TO APPLY

To apply for the position(s), the candidate should submit the following application materials:

1. Application Letter
a) self-evaluation of professional quality, ability, and character
b) personal statement about research and teaching ability
c) work plan for this position
2. The Teaching and Research Job Application Form
3. CV in English and Chinese (if possible)

IMPORTANT NOTE: Please email the contacts below with application materials as attachments and specify the applied position ID + your name.

Human resource contact person:

Ms. Wang, e-mail: hr@hitsz.edu.cn.

Innovative Industrial Design Research Center Contact:
Dr. Yao, e-mail: julie_j_yao@163.com

Posted in artificial intelligence, Big data, Data Mining, Data science, Research | Leave a comment

An Introduction to Sequence Prediction

In this blog post, I will give an introduction to the task of sequence prediction,  a popular data mining/machine learning task, which consist of predicting the next symbol of a sequence of symbols. This task is important as it have many real-life applications such as webpage prefetching and product recommendation.

What is a sequence?

Before defining the problem of sequence prediction, it is necessary to first explain what is a sequence. A sequence is an ordered list of symbols. For example, here are some common types of sequences:

  • A sequence of webpages visited by a user, ordered by the time of access.
  • A sequence of words or characters typed on a cellphone by a user, or in a text such as a book.
  • A sequence of products bought by a customer in a retail store
  • A sequence of proteins in bioinformatics
  • A sequence of symptoms observed on a patient at a hospital

Note that in the above definition, we consider that a sequence is a list of symbols and do not contain numeric values.  A sequence of numeric values is usually called a time-series rather than a sequence, and the task of predicting a time-series is called time-series forecasting. But this is another topic.

What is sequence prediction?

The task of sequence prediction consists of predicting the next symbol of a sequence based on the previously observed symbols. For example, if a user has visited some webpages A, B, C, in that order, one may want to predict what is the next webpage that will be visited by that user to prefetch the webpage.

prediction

An illustration of the problem of sequence prediction

There are two steps to perform sequence prediction:

  1. First,  one must train a sequence prediction model using some previously seen sequences called the training sequences.  This process is illustrated below:

    seq_modelFor example, one could train a sequence prediction model for webpage prediction using the sequences of webpages visited by several users.

  2. The second step is to use a trained sequence prediction model to perform prediction for new sequences (i.e. predict the next symbol of a new sequence), as illustrated below.

    seq_model2For example, using a prediction model trained with the sequences of webpages visited by several users, one may predict the next webpage visited by a new user.

An overview of state-of-the-art sequence prediction models

Having defined what are the main sequence prediction models, that could be used in an application?

There are actually numerous models that have been proposed by researchers such as DG, All-k-order Markov, TDAG, PPM, CPT and CPT+.  These models utilize various approaches. Some of them uses for example neural networks, pattern mining and a probabilistic approach.

How to determine if a sequence prediction model is good?

Various sequence prediction models have different advantages and limitations, and may perform more or less well on different types of data. Typically a sequence prediction model is evaluated in terms of criteria such as prediction accuracy, the memory that it uses and the execution time for training and performing predictions.

Several benchmark have been done in the literature to compare various models. For example, here is a recent benchmark performed by my team in our PAKDD 2015 paper about sequence prediction with CPT+accuracy

In this benchmark, we compare our proposed CPT+ sequence prediction model with several state-of-the-art models on various types of data.  Briefly, BMS, MSNBC, Kosarak and Fifa are sequences of webpages. SIGN is a sign-language dataset. Bible word and Bible char are datasets of sequences of Words and characters. As it can be seen, for this type of data at least, CPT+ greatly outperform other models.  There are several reasons. One of them is that several models such as DG assume the Markovian hypothesis that the next symbol only depends on the previous symbols. Another reason is that the CPT+ model use an efficient indexing approach to consider all the relevant data for each prediction (see details in the paper).

Where can I get open-source implementations of sequence prediction models?

Some open-source and Java implementation of the seven discussed sequence prediction models (DG, AKOM, TDAG, LZ78, PPM, CPT, CPT+) can be found in the SPMF open-source data mining library, which includes the implementations from the IPredict project.

There are extensive documentation about how to use these models on the SPMF website. Here I will provide a quick example that shows how the CPT model can be easily applied with just a few lines of code.

               // Phase 1:  Training
                // Load a file containing the training sequences into memory
		SequenceDatabase trainingSet = new SequenceDatabase();
		trainingSet.loadFileSPMFFormat("training_sequences.txt", Integer.MAX_VALUE, 0, Integer.MAX_VALUE);

		// Train the prediction model
                String optionalParameters = "splitLength:6 recursiveDividerMin:1 recursiveDividerMax:5";
		CPTPredictor predictionModel = new CPTPredictor("CPT", optionalParameters);
		predictionModel.Train(trainingSet.getSequences());

		// Phase 2: Sequence prediction
                // We will predict the next symbol after the sequence <1,4>
		Sequence sequence = new Sequence(0);
		sequence.addItem(new Item(1));
		sequence.addItem(new Item(4));
		Sequence thePrediction = predictionModel.Predict(sequence);
		System.out.println("For the sequence <(1),(4)>, the prediction for the next symbol is: +" + thePrediction);

Thus, without going into the details of each prediction models, it can be seen that it is very easy to train a sequence prediction model and use it to perform predictions.

Conclusion

This blog post has introduced the task of sequence prediction, which has many applications. Furthermore, implementations of open-source sequence prediction models have been presented.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Data Mining, Data science, Research | Tagged , , , | Leave a comment

Six important skills to become a succesful researcher

Today, I will discuss how to become a good researcher and what are the most important skills that a researcher should have. This blog post is aimed at young Master degree students and Ph.D students, to provide some useful advice to them.

1) Being humble and open to criticism

An important skill to be a good researcher is to be humble and to be able to listen to others. Even when a researcher works very hard and think that his/her project is “perfect”, there are always some flaws or some possibilities for improvement.

A humble researcher will listen to the feedback and opinions of other researchers on their work, whether this feedback is positive or negative, and will think about how to use this feedback to improve their work. A researcher that works alone can do an excellent work. But by discussing research with others, it is possible to get some new ideas. Also, when a researcher present his/her work to others, it is possible to better understand how people will view your work. For example, it is possible that other people will misunderstand your work because something is unclear. Thus, the researcher may need to make adjustments to his research project.

2) Building a social network

A second important thing to work on for young researchers is to try to build a social network. If a researcher has opportunities to attend international conferences, s/he should try to meet other students/professors to establish contact with other researchers. Other ways of establishing contact with other researchers are to send e-mails to ask questions or discuss research, or it could also be at a regional or national level by attending seminars at other universities.

network-1020332_960_720

Building a social network is very important as it can create many opportunities for collaborations. Besides, it can be useful to obtain a Ph.D position at another university (or abroad), a post-doctoral research position or even a lecturer position or professor position in the future, or to obtain some other benefits such as being invited to give a talk at another university or being part of the program committee of conferences and workshops. A young researcher has to often work by herself/himself. But he should also try to connect with other researchers.

For example, during my Ph.D. in Canada, I established contact with some researchers in Taiwan, and I then applied there for doing my postdoc. Then, I used some other contacts recently to find a professor position in China, where I then applied and got the job. Also, I have done many collaborations with other researchers that I have met at conferences.

3) Working hard, working smart

To become a good researcher, another important skill is to spend enough time on your project. In other words, a successful researcher will work hard. For example, it is quite common that good researchers will work more than 10 hours a day. But of course, it is not just about working hard, but also about working “smart”, that is a researcher should spend each minute of his time to do something useful that will make him/her advance toward his goals. Thus, working hard should be done also with a good planning.

woman-studying-cartoon

When I was a MSc and Ph.D. student, I could easily work more than 12 hours a day. Sometimes, I would only take a few days off during the whole year. Currently, I still work very hard every day but I have to take a little it more time off due to having a family. However, I have gained in efficiency. Thus, even by working a bit less, I can be much more productive than I was a few years ago.

4) Having clear goals / being organized / having a good research plan

A researcher should also have clear goals. For a Ph.D or MSc student, this includes having a general goal of completing the thesis, but also some subgoals or milestones to attain his main goal. One should also try to set dates for achieving these goals. In particular, a student should also think about planning their work in terms of deadlines for conferences. It is not always easy to plan well. But it is a skill that one should try to develop.  Finally, one should also choose his research topic(s) well to work on meaningful topics that will lead to making a good research contribution.

5) Stepping out of the comfort zone

A young researcher should not be afraid to step out of his comfort zone. This includes trying to meet other researchers, trying to establish collaborations with other researchers, trying to learn new ideas or explore new and difficult topics, and also to study abroad.

risk

For example, after finishing my Ph.D. in Canada, which was mostly related to e-learning, I decided to work on the design of fundamental data mining algorithms for my post-doctoral studies and to do this in Taiwan in a data mining lab. This was a major change both in terms of research area but also in terms of country. This has helped me to build some new connections and also to work in a more popular research area, to have more chance of obtaining a professor position, thereafter. This was risky, but I successfully made the transition. Then, after my postdoc I got a professor job in Canada in a university far away from my hometown. This was a compromise that I had to make to be able to get a professor position since there are very few professor positions available in Canada (maybe only 5 that I could apply for every year). Then, after working as a professor for 4 years in Canada, I decided to take another major step out of my comfort zone by selling my house and accepting a professor job at a top 9 university in China. This last move was very risky as I quit my good job in Canada where I was going to become permanent. Moreover, I did that before I actually signed the papers for my job in China. And also from a financial perspective I lost more than 20,000 $ by selling my house quickly to move out. However, the move to China has paid off, as in the next months, I  got selected by a national program for young talents in China. Thus, I now receive about 10 times the funding that I had in Canada for my research, and my salary is more than twice my salary as a professor in Canada, thus covering all the money that I had lost by selling my house. Besides, I have been promoted to full professor and will lead a research center. This is an example of how one can create opportunities in his career by taking risks.

6) Having good writing skills

A young researcher should also try to improve his writing skills. This is very important for all researchers, because a researcher will have to write many publications during his career. Every minute that one spends on improving writing skills will pay off sooner or later.

In terms of writing skills, there are two types of skills.

  • First, one should be good at writing in English without grammar and spelling errors.
  • Second, one should be able to organize his ideas clearly and write a well-organized paper (no matter if it is written in English or another language). For a student, is important to work to improve these two skills during their MSc and Ph.D studies.

These skills are acquired by writing and reading papers, and spending the time to improve yourself when writing (for example by reading the grammar rules when unsure about grammar).

Personally, I am not a native English speaker. I have thus worked hard during my graduate studies to improve my English writing skills.

Conclusion

In this brief blog post, I gave some general advice about important skills for becoming a successful researcher. I you think that I have forgotten something, please post it as a comment below.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in General, Research | 2 Comments

Finding a Data Scientist Unicorn or building a Data Science Team?

In recent months/years, many blog posts have been trending on the social Web about what is a “data scientist“, as this term has become very popular.  As there is much hype about this term, some people have even jokingly said that a “data scientist is a statistician who lives in San Francisco“.

In this blog post, I will talk about this recent discussion about what is a data scientist, which has led some people to claim that there is some easy signs to recognize a bad data scientist or a fake data scientist. In particular, I will discuss the blog post “10 signs of a bad data scientists”  and explain why this discussion is going the wrong way.

10 signs of a bad data scientists
In this blog post, authors claim that a data scientists must be good a math/statistics, good at coding, good at business, and know most of the tools from Spark, Scala, Pyhthon, SAS to Matlab.

What is wrong with this? The obvious is that it is not really necessary to know all these technologies to analyze data. For example, a person may never have to use Spark to analyze data and will rarely use all these technologies in the same environment. But more importantly, this blog post seems to imply that a single person should replace a team of three persons: (1)  a coder with data mining experience, (2) a statistician, and (3) someone good at business.  Do we really need to have a person that can replace these three persons? The problem with this idea is that a person will always be stronger on one of these three dimensions and weaker on the two other dimensions. Having a person that possess skills in these three dimensions, and is also excellent in these three dimensions is quite rare. Hence, I here call it the  data scientist unicorn, that is a person that is so skilled that he can replace a whole team.

The data scientist unicorn

The data scientist unicorn

In my opinion, instead of thinking about finding that unicorn, the discussion should rather be about creating a good data science team, consisting of the three appropriate persons that are respectively good at statistics, computer sciences, and business, and also have a little background/experience to be able to discuss with the other team members. Thus, perhaps that we should move the discussion from what is a good data scientist to what is a good data science team.

A data science team

A data science team

An example

I will now discuss my case as an example to illustrate the above point that I am trying to make. I am a researcher in data mining. I have a background in computer science and I have worked for 8 years on designing highly efficient data mining algorithms to analyze data.  I am very good at this, ( I am even the founder of the popular Java SPMF data mining library). But I am less good at statistics, and I have almost no knowledge about business.

But this is fine because I am an expert at what I am doing, in one of these three dimensions, and I can still collaborate with a statistician or someone good at business, when I need.  I should not replace a statistician. And it would be wrong to ask a statistician to do my job of designing highly efficient algorithms, as it requires many years of programming experience and excellent knowledge of algorithmic .

A risk with the vision of the “data scientist unicornthat is good at everything is that it may imply that the person may not be an expert at any of those things.

Perhaps that a solution for training good data scientist is those new “data science” degrees that aim at teaching a little bit of everything. I would not say whether these degrees are good or not, as I did not look at these programs. But there is always the risk of training people who can do everything but are not expert at anything.  Thus, why not trying to instead build a strong data science team?

==

Philippe Fournier-Viger is a full professor  and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms.

If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Data Mining, Data science | Leave a comment