This brief blog discusses what not to do when applying for a M.Sc. or Ph.D. position in a research lab. The aim of this post is to give advices to those applying for such positions.
I had previously discussed about this topic in another blog post, where I explained that it is important to send personalized e-mails to potential supervisors rather than sending the same e-mail to many professors. I will thus not explain that again.
In this post, I will rather emphasizes another important aspect, which is to give a good impression of yourself to other people. I discuss this by using an e-mail that I have received today:
From *****@***.sa
Subject: Apply For Scholars ship Ph.d
Sir Philippe Fournier IHOPE TO APPLY MY TO YOR PROGRAM OF DO PH.D ON COMPUTER SCIENCE ,I HAVE READ THAT YOU OFFER PH.D PROGRAM ON COMPUTER SCINCE AS GENERAL MY TOPIC IS WEP APPLICATION SECIRTY ALL THAT .I HAVE BE IN PAKISTAN FOR SEVEN YERS IDID MASTER THERE SO MY PROJECT WAS WEB SIDE FOR SINDH UNIVERSITY FROM THAT DATE 2010-7-21 IDID TWO PAPER PUBLICATION .WITH ALL OF INTERESTED TO WORK WITH YOU HOPE TO APPLY MY ON YOUR PROGRAMS. THANKS FOR YOR TIMEM
A person sending this type of e-mails has zero chances of getting a position in my team. Why?
It is poorly written. There are many typos and English errors. If this person cannot take the time to write an e-mail properly, then it gives the impression that this person is careless, and would do a bad job.
The person did not take the time to spell my full name correctly. Not spelling the name properly shows a lack of respect or shows carelessness. This is something to absolutely avoid.
The person asks to work on web security. I have never published a single paper on that topic. Thus, I would not hire that person. This person should contact a professor working on such topics.
The applicant does not provide his CV and very few information about himself. He mentions that he has two publications. But it does not mean anything if I don’t know where they have been published. There are so many bad conferences and journals. An applicant should always send his CV, to avoid sending e-mails back and forth to obtain the information. I will often not answer the e-mails if there are no CV attached to the e-mail, just because it wastes time to send e-mails to ask for the CV, when it should be provided. Besides, when a CV is provided, it should be detailed enough. It is even better when a student can provides transcripts showing his grades in previous studies.
The applicant does not really explain why he wants to work with me or how he has found my profile. For a master degree, this is not so important. But when applying for a Ph.D., it is expected that the applicant will chose his supervisor for a reason such as common research interests (as I have discussed above).
That is all I wanted to write for today.
If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!
This tutorial will explain how to analyze text documentsto discover complex and hidden relationships between words. We will illustrate this with a Sherlock Holmes novel. Moreover we will explain how hidden patterns in text can be used to recognize the author of a text.
The Java open-source SPMF data mining library will be used in this tutorial. It is a library designed to discover patterns in various types of data, including sequences, which can also be used as a standalone software, and to discover patterns in other types of files. Handling text documents is a new feature of the most recent release of SPMF (v.2.01).
Obtaining a text document to analyze
The first step of this tutorial is to obtain a text document that we will analyze. A simple way of obtaining text documents is to go on the website of Project Gutenberg, which offers numerous public domain books. I have here chosen the novel “Sherlock Holmes: The Hound of the Baskervilles” by Arthur Conan Doyle. For the purpose of this tutorial, the book can be downloaded here as a single text file: SHERLOCK.text. Note that I have quickly edited the book to remove unnecessary information such as the table of content that are not relevant for our analysis. Moreover, I have renamed the file so that it has the extension “.text” so that SPMF recognize it as a text document.
Downloading and running the SPMF software
The first task that we will perform is to find the k most frequent sequences of words in the text. We will first download the SPMF software from the SPMF website by going to the download page. On that webpage, there is a lot of detailed instructions explaining how the software can be installed. But for the purpose of this tutorial, we will directly download spmf.jar, which is the version of the library that can be used as a standalone program with a user interface.
Now, assuming that you have Java installed on your computer, you can double-click on SPMF.jar to launch the software. This will open a window like this:
Discovering hidden patterns in the text document
Now, we will use the software to discover hidden patterns in the Sherlock Holmes novel. There are many algorithms that could be applied to find patterns. We will choose the TKS algorithm, which is an algorithm for finding the k most frequent subsequences in a set of sequences. In our case, a sequence is a sentence. Thus, we will find the k most frequent sequences of words in the novel. Technically, this type of patterns is called skip-grams. Discovering the most frequent skip-grams is done as follows.
A) Finding the K most frequent sequences of words (skip-grams)
We will choose the TKS algorithm
We will choose the file SHERLOCK.text as input file
We will enter the name test.txt as output file for storing the result
We will set the parameter k of this algorithm to 10 to find the 10 most frequent sequences of words.
We will click the “Run algorithm” button.
The result is a text file containing the 10 most frequent patterns
The first line for example indicates that the word “the” is followed by “of” in 762 sentences from the novel. The second line indicates that the word “in” appears in 773 sentences from the novel. The third line indicates that the word “the” is followed by “the” in 869 sentences from the novel. And so on. Thus we will next change the parameters to find consecutive sequences of words
B) Finding the K most frequent consecutive sequences of words (ngrams)
The above patterns were not so interesting because most of these patterns are very short. To find more interesting patterns, we will set a minimum pattern length for the patterns found to 4. Moreover, another problem is that some patterns such as “the the” contains gaps between words. Thus we will also specify that we do not want gaps between words by setting the max gap constraint to 1. Moreover, we will increase the number of patterns to 25. This is done as follows:
We set the number of patterns t0 25
We set the minimum length of patterns to 4 words
We require that there is no gap between words (max gap = 1)
We will click the “Run algorithm” button.
The result is the following patterns:
Now this is much more interesting. It shows some sequences of words that the author of Sherlock Holmes tends to use repeatedly. The most frequent 4-word sequences is “in front of us” which appears 13 times in this story.
It would be possible to further adjust the parameters to find other types of patterns. For example, using SPMF, it is also possible to find all patterns having a frequency higher than a threshold. This can be done for example with the CM-SPAM algorithm. Let’s try this
C) Finding all sequences of words appearing frequently
We will use the CM-SPAM algorithm to find all patterns of at least 2 words that appear in at least 1 % of the sentences in the text. This is done as follows:
We choose the CM-SPAM algorithm
We set the minimum frequency to 1 % of the sentences in the text
We require that patterns contain at least two words
We will click the “Run algorithm” button.
The result is 113 patterns. Here is a few of them:
Here there are some quite interesting patterns. For example, we can see that “Sherlock Holmes” is a frequent pattern appearing 31 times in the text, and that “sir Charles” is actually more frequent than “Sherlock Holmes”. Other patterns are also interesting and give some insights about the writing habits of the author of this novel.
Now let’s try another type of patterns.
D) Finding sequential rules between words
We will now try to find sequential rules. A sequential rule X-> Y is sequential relationship between two unordered sets of words appearing in the same sentence. For example, we can apply the ERMiner algorithm to discover sequential rules between words and see what kind of results can be obtained. This is done as follows.
We choose the ERMiner algorithm
We set the minimum frequency to 1 % of the sentences in the text
We require that rules have a confidence of at least 80% (a rule X->Y has a confidence of 80% if the unordered set of words X is followed by the unordered set of words Y at least 80% of the times when X appears in a sentence)
We will click the “Run algorithm” button.
The result is a set of three rules.
The first rule indicates that every time that 96 % of the time, when “Sherlock” appears in a sentence, it is followed by “Holmes” and that “Sherlock Holmes” appeared totally 31 times in the text.
For this example, I have chosen the parameters to not obtain too many rules. But it is possible to change the parameters to obtain more rules for example by changing the minimum confidence requirement.
Applications of discovering patterns in texts
Here we have shown how various types of patterns can be easily extracted from text files using the SPMF software. The goal was to give an overview of some types of patterns that can be extracted. There are also other algorithms offered in SPMF, which could be used to find other types of patterns.
Now let’s talk about the applications of finding patterns in text. One popular application is called “authorship attribution“. It consists of extracting patterns from a text to try to learn about the writing style of an author. Then the patterns can be used to automatically guess the author of an anonymous text.
For example, if we have a set of texts written by various authors, it is possible to extract the most frequent patterns in each text to build a signature representing each author’s writing style. Then, to guess the author of an anonymous text, we can compare the patterns found in the anonymous text with the signatures of each author to find the most similar signature. Several papers have been published on this topic. Besides using words for authorship attribution, it is also possible to analyze the Part-of-speech tags in a text. This requires to first transform a text into sequences of part-of-speeches. I will not show how to do this in this tutorial. But it is the topic of a few papers that I have recently published with my student. We have also done this with the SPMF software. If you are curious and want to know more about this, you may look at my following paper:
There are also other possible applications of finding patterns in text such as plagiarism detection.
Conclusion
In this blog post, I have shown how open-source software can be used to easily find patterns in text. The SPMF library can be used as a standalone program or can be called from other Java program. It offers many algorithms with several parameters to find various types of patterns. I hope that you have enjoyed this tutorial. If you have any comments, please leave them below.
== Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
One year and a half ago, I was working as a professor at a university in Canada. But I took the decision to not renew my contract and move to China. At that time, some people may have thought that I was crazy to leave my job in Canada since it was an excellent job, and I also had a house and a car. Thus, why going somewhere else? However, as of today, I can tell you that moving to China has been one of the best decisions that I ever took for my career. In this blog post, I will tell you my story an explain why I moved there. I will also compare the working conditions that I had in Canada with those that I have now in China.
Before moving to China
After finishing my Ph.D in 2010, I have worked as a post-doctoral researcher in Taiwan for a year. Then, I came back to Canada and worked as a faculty member for about 4 years there. However, in Canada, the faculty positions are very rare. When I was in Canada, I was hoping to move to another faculty position closer to my hometown to be closer to my family, but it has been almost impossible since there are about only five faculty positions that I could apply related to my research area in computer science, every year, for the whole country! Thus, getting a faculty position in Canada is extremely difficult and competitive. There are tons of people applying and very few positions available.
I had several interviews at various universities in Canada. But getting a faculty position in another university in Canada was hard because of various reasons. Sometimes a job can be announced but the committee can already have someone in mind or may prefer some other candidates for various strange reasons. For example, the last interview that I had in Canada about two years ago was at a university in Quebec, and basically, they hired someone else that had almost no research experience due to some “political reasons”. Just to give you a sense of how biased that hiring process was, here is a comparison of the candidate that was hired and me:
Total number of citations: < 150 (the selected candidate) 1031 (me)Number of citations (most cited paper): < 20 (the selected candidate) 134 (me)Number of citations (last year): < 30 (the selected candidate) >300 (me)Number of papers (this year): 4 (the selected candidate) >40 (me)
So who would you hire? But anyway, I just show this as an example to show that the hiring process is not always fair. Actually, this could have happened anywhere in the world. But when there are very few jobs available, as in Canada, it makes it even harder to find a position. But, it does not bother me, since this has led me to try something else and move to China, which has been one of the best decision for my career!
Before explaining what happened after this, let me make it clear that I did not leave my previous job in Canada because I did not like it. Actually, I had the chance to work at a great university in Canada and I made many friends there and had also had some wonderful students. I had my first opportunity to work as a professor there and it was a hard decision to leave. However, to go further in my career as a researcher, I wanted to move to a bigger university.
Moving to China
Thus, at the end of June 2015, I decided to apply for a faculty position at a top university in China. I quickly passed the interview and started to work there a few months later after quickly selling my house and my car in Canada. So now let’s talk about what you probably want to know: how my current job in China is compared to my previous position in Canada?
Well, I must first say that I moved to one of the top 10 university in China, which is also one of the top 100 university in the world for computer science. Thus, the level of the students there is quite high, and it is also an excellent research environment. But let’s analyse this in more details.
In terms of research funding:
In Canada, it has become extremely difficult to receive research funding due to budget cuts in research and the lack of major investment in education by the government. To give you an idea, the main research funding association called NSERC could only give me 100,000$ CAD for five years, and I was considered lucky to have this funding. But this is barely enough to pay one graduate student and attend one or two conference per year.
In China, on the other hand, the Chinese government offers incredible research funding opportunities. Of course, not everyone is equally funded. The smaller universities do not receive as much funding as the top universities. But there is some very good research program to support researchers, and especially the top researchers. In my case, I applied for a program to recruit talentsby the NSFC (National Science Foundation of China). I was awarded 4,000,000 RMB in research funding (about 800,000 $ CAD), for five years. Thus I now receives about eight times more funding than what I received in Canada for my research. This of course now make a huge difference for my research. I can thus buy expensive equipment that I needed such as a big data cluster, hire a post-doc, pay many students, and perhaps even hire a professional programmer eventually to support my research. Besides, after getting this grant for young talents, I was automatically promoted to Full Professor, and will soon become the director of a research center, and will get my own lab. This is a huge improvement for my career compared to what I had in Canada.
Now let’s compare the salary:
In Canada, I had a decent salary for a university professor.
In China, my base salary is already about 15% higher than what I received in Canada. This is partly due to the fact that I work in a top university, located in a rich city (Shenzhen) and that I also received a major pay increase after receiving the young talent funding. However, besides the salary, it is possible to receive many bonuses in China that can increase your salary through various program. Just to give you an example, in the city of Shenzhen, there is a program called the Peacock program that can provide more than 2,000,000 RMB (about 400,000 CAD $) for living expenses for excellent researchers working in that city, on five years. I will not say how much I earn. But by including these special program(s), I can say that my salary is now about twice what I earned in Canada.
In terms of living expenses, living in China is of course much less expensive than living in Canada. And the income tax is more or less similar, depending on the provinces in Canada. In the bigger cities in China, renting an apartment can be expensive. However, everything else is cheap. Thus, the overall living cost is much lower than Canada.
In terms of life in general, of course, the life is different in China than in Canada, in many different ways. There are always some advantages and disadvantages to live in any country around the world, as nothing is perfect anywhere. But I really enjoy my life in China. And since I greatly enjoy the Chinese culture (and speak some Chinese), this is great for me. The city where I work is a very modern city that is very safe (I would never be worried about walking late at night). In terms of work environment, I am also very satisfied. I have great colleagues, and everyone is friendly. It is on overall very exciting to work there, and I expect that it will greatly improve my research in the next few years.
Also, it is quite inspiring to work and contribute to a top university and a city that are currently very quickly expanding. To give you an idea, the population of that city has almost doubled in the last fifteen years, reaching more than 10 million persons, and 18 million when including the surrounding areas. And a few new subway lines have opened in the last few years. There are also many possibilities for projects with the industry and the government in such a large city.
Conclusion
In this blog post, I wanted to discuss a little bit about the reasons why I decided to move to China, and why I consider that it is one of the best decisions that I ever took for my career, as I think that it would be interesting for other researchers.
By the way, if you are a data mining researcher and are looking for a faculty position in China, you may leave me a message. My research center is looking to hire at least one professor with a data mining background.
== Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
This week, I have been attending the DEXA 2016 and DAWAK 2016 conferences, in Porto, Portugal, from the 4th to 8th September2016, to present three papers. In this blog post, I will give a brief report about these conferences.
About these conferences
The DEXAconference is a well-established international conference related to database and expert systems. This year, it was the 27th edition of the conference. It is a conference that is held in Europe, every year, but still attracts a considerable amount of researchers from other continents.
DEXA is a kind of multi-conference. It actually consists of 6 small conferences that are organized together. Below, I provide a description of each of those sub-conferences and indicate their acceptance rate.
DEXA 2016 (27th Intern. Conf. on Database and Expert Systems Applications).
Acceptance rate: 39 / 137 = 28% for full papers, and another 29 / 137 = 21 % for short papers
DaWaK 2016 (18th Intern. Conf. on Big Data Analytics and Knowledge Discovery)
Acceptance rate: 25 / 73= 34%
EGOVIS 2016 (5th Intern. Conf. on Electronic Government and the Information Systems Perspective)
Acceptance rate: not disclosed in the proceedings, 22 papers published
ITBAM 2016 (7th Intern. Conf. on Information Technology in Bio- and Medical Informatics)
Acceptance rate: 9 / 26 = 36 % for full papers, and another 11 / 26 = 42% for short papers
TrustBus 2016 (13th Intern. Conf. on Trust, Privacy, and Security in Digital Business)
Acceptance rate: 25 / 73= 43%
EC-Web 2016 (17th Intern. Conf. on Electronic Commerce and Web Technologies)
Thus, the DEXA conference is more popular than DAWAK and the other sub-conferences and also is more competitive in terms of acceptance rate than the other sub-conferences.
Proceedings
The proceedings of each of the first five sub-conferences are published by Springer in the Lecture Notes in Computer Science Series, which is quite nice, as it ensures that the papers are indexed by major indexes in computer science such as DBLP. The proceedings of the conferences were given on a USB drive.
badge and proceedings of DEXA
The conference location
The conference was locally organized by the Instituto Superior de Engenharia do Porto (ISEP) in Porto, Portugal. The location has been great, as Porto is a beautiful European city with a long history. The old town of Porto is especially beautiful. Moreover, visiting Porto is quite inexpensive.
First day of the conference
The first day of the conference started at 10:30 AM. The first day was mostly paper presentations. The main topics of the papers during the first day of DEXA were temporal databases,high-utility itemset mining, periodic pattern mining, privacy-preserving data mining, and clustering. In particular, I had two papers presentations related to itemsets mining:
a paper about discovering high utility itemsets with multiple thresholds.
Besides, there was a keynote entitled “From Natural Language to Automated Reasoning” by Bruno Buchberger from Austria, a famous researcher in the field of symbolic computation. The keynote was about using formal automated reasoners (e.g. math theorem prover) based on logic to analyze texts. For example, the speaker proposed to extract formal logic formulas from tweets to then understand their meaning using automated reasoners and a knowledge base provided by the user. This was a quite unusual perspective on tweet analysis since nowadays, researchers in natural language processing prefer using statistical approaches to analyze texts rather than using approaches relying on logic and a knowledge base. This gave rise to some discussion during the questions period after the keynote.
DEXA 2016 first keynote speech
During the evening, there was also a reception in a garden inside the institute were the conference was held.
Second day of the conference
On the second day, I have attended DAWAK. In the morning, there was several paper presentations. I have presented a paper about recent high-utility itemset mining. The idea is to discover itemsets (set of items) that have been recently profitable in customer transactions, to then use this insight for marketing decisions. There was also an interesting paper presentation about big data itemset mining by student Martin Kirchgessner from France.
Then, there was an interesting keynote about the price of data by Gottfried Vossen from Germany. This talk started by discussing the fact that companies are collecting more and more rich data about persons. Usually, many persons give personal data for free to use services such as Facebook or Gmail. There also exist several marketplaces where companies can buy data such as the Microsoft Azure Marketplace and also different pricing models for data. For example, one could have different pricing models to sell more or less detailed views of the same data. There also exist repositories of public data. Moreover other issues are what happen with the data of someone when he dies. In the future, a new way of buying products could be to pay for data about the design of an object, and then print it by ourselves using 3d printers or other tools. Other issues related to the sale of data is DRM, selling second-hand data, etc. Overall, it was not a technical presentation, but it discussed an important topic nowadays which is the value of data in our society that relies more and more on technologies.
Third day of the conference
On the third day of the conference, there was more paper presentations and also a keynote that I have missed. On the evening, there was a nice banquet at a wine cellar named Taylor. We had the pleasure to visit the cellar, enjoy a nice dinner, and listen to a Portuguese band at the end of evening.
Conclusion
This was globally a very interesting conference. I had opportunity to discuss with some excellent researchers, especially from Europe, including some that I had met at other conferences. There was also some papers quite related to my sub-field of research in data mining. DEXA may not be a first tier conference, but it is a very decent conference, that I would submit more papers to in the future.
== Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
This week, I have attended the IEA AIE 2016 conference, held at Morioka, Japan from the 2nd to the 4thAugust 2016. In this blog post, I will briefly discuss the conference.
About the conference
IEA AIE 2016 (29th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems) is an artificial intelligence conference with a focus on applications of artificial intelligence. But this conference also accepts theoretical papers, as well as data science papers.
First day of the conference
Two keynote speeches were given on the first day of the conference. For the second keynote, I have seen about 50 persons attending the conference. Many people were coming from Asia, which is understandable since the conference is held in Japan.
The paper presentations on the first day were about topics such as Knowledge-based Systems, Semantic Web, Social Network (clustering, relationship prediction), Neural Networks, Evolutionary Algorithms and Heuristic Search, Computer Vision and Adaptive Control.
Second day
On the second day of the conference, there was a great keynote talk by Prof. Jie Lu from Australia about recommender systems. She first introduced the main recommendation approaches (content-based filtering, collaborative filtering, knowledge-based recommendation, and hybrid approaches) and some of the typical problems that recommender systems face (cold-start problem, sparsity problem, etc.). She then talked about her recent work on extensions of the recommendation problem such as group recommendation (e.g. recommending a restaurant that will globally satisfy or not disapoint a group of persons), trust-based recommendation (e.g. a system that recommend products to you based on what friends that you trust have liked, or the friends of your friends), fuzzy recommender systems (a recommender systems that consider each item can belong to more than one category), and cross-domain recommendation (e.g. if you like reading books about Kung Fu, you may also like watching movies about Kung Fu).
After that there was several paper presentations. The main topics were Semantic Web, Social networks, Data Science, Neural Networks, Evolutionary Algorithms, Heuristic Search, Soft Computing and Multi-Agent Systems.
In the evening, there was a great banquet at the Morioka Grand Hotel, a nice hotel located on a mountain, on the outskirts of the city.
Moreover, during the dinner, there was a live music band:
Third day
On the third day, there was an interesting keynote on robotics by Prof. Hiroshi Okuno from the Waseda University / University of Tokyo. His team has proposed an open-source sofware called Hark for robot audition. Robot audition means the general process by which a robot can process sounds from the environment. The sofware which is the results of years of resutls has been used in robots equipped with arrays of microphones. By using the Hark library, robots can listen to multiple persons talking to the robot at the same time, localize where the sounds came from, and isolate sounds, among many other capabilities.
It was followed by paper presentations on topics such as Data Science (KNN, SVM, itemset mining, clustering), Decision Support Systems, Medical Diagnosis and Bio-informatics, Natural Language Processing and Sentiment Analysis. I have presented a paper about high utility itemset mining for a new algorithm called FHM+.
Location
The location of the conference is Morioka, a not very large city in Japan. However, the timing of the conference was perfect. It was held during the Sansa Odori festival, one of the most famous festival in Japan. Thus, during the evenings, it was possible to watch the Sansaparade, where people wearing traditional costumes where playing Taiko drums and dancing in the streets.
Conclusion
The conference has been quite interesting. Since it is a quite general conference, I did not discuss with many people that were close to my research area. But I met some interesting people, including some top researchers. During the conference, it was announced that the conference IEA AIE 2017 will be held in Arras (close to Paris, France).
==
Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
In this blog post, I will discuss how to give a good paper presentation at an academic conference. If you are a researcher, this is an important topic for you researcher because giving a good presentation of your work will raise interest in your work. In fact, a good researcher should not only be good at doing research, but also be good at communicating the results of this research in terms of written and oral communication.
Rule 1 : Prepare yourself, and know the requirements
Giving a good oral presentation starts with a good preparation. One should not prepare his presentation the day before the presentation but a few days before, to make sure that there will be enough time to prepare well. A common mistake is to prepare a presentation the evening before the presentation. In that case, the person may finish preparing the presentation late, not sleep well, be tired, and as a result give a poor presentation.
Preparing a presentation does not means only to design some Powerpoint slides. It also means to practice your presentation. If you are going to give a paper presentation at a conference, you should ideally practice giving your presentation several times in your bedroom or in front of friends before giving the presentation in front of your audience. Then, you will be more prepared, you will feel less nervous, and you will give a better presentation.
It is also important to understand the context of your presentation: (1) who will attend the presentation? (2) how long the presentation should be ? (3) what kind of equipment will be available to do the presentation (projector, computer, room, etc.) ?, (4) what is the background of the people attending the presentation? These questions are general questions that needs to be answered to help you prepare an oral presentations.
Who will attend the presentation is important. If you do a presentation in front of experts from your field the presentation should be different than if you present to your friend, your research advisor, or to some kids. A presentation should always be adapted to the audience.
To avoid having bad surprises, it is always better to check the equipment that will be available for your presentation and prepare some backup plan in case that some problems occur. For example, one may bring his laptop and a copy of his presentation on a USB drive as well as a copy in his e-mail inbox, just in case.
It is also important to know the expected length of the presentation. If the presentation at a conference should last no more than 20 minutes, for example, then one should make sure that the presentation will not last more than 20 minutes. At an academic conference, it is quite possible that someone will stop your presentation if you exceed the time limit. Moreover, exceeding the time limit may be seen as disrespectful.
Rule 2 : Always look at your audience
When giving a presentation, there are a few important rules that should always be followed. One of the most important one is to always look at your audience when talking.
One should NEVER read the slides and turn his back to the audience for more than a few seconds. I have seen some presenters that did not look at the audience for long periods of time at academic conferences, and it is one of the best way to annoy the audience. For example, here are some pictures that I have took at an academic conference.
Not looking at the audience
Turning your back to audience
In that presentation, the presenter barely looked at the audience. Either he was looking at the floor (first picture) when talking or either he was reading the slides (second picture). This is one of the worst thing to do, and the presentation was in fact awful. Not because the research work was not good. But because it was poorly presented. To do a good presentation, one should try to look at the audience as much as possible. It is OK sometimes to look at a slide to point something out, but it should not be more than a few seconds. Otherwise, the audience may lose interest in your presentation.
Personally, when I give a presentation, I look very quickly at the computer screen to see some keywords and remember what I should say, and then I look at the audience to talk. Then, when I go to the next slide, I will briefly look at the screen of my computer to remember what I should say on that slide and then I continue looking at the audience while talking. Doing this results in much better presentation. But it may require some preparation. If you practice giving your talk several times in your bedroom for example, then you will become more natural and you will not need to read your slides when it will be the time to give your presentation in front of an audience.
Rule 3: Talk loud enough
Other important things to do is to talk LOUD enough when giving a presentation, and speak clearly. Make sure that even the people in the back of the room can hear your clearly. This seems like something obvious. But several times at academic conferences, there are some presenters who do not speak loud enough, and it becomes very boring for the audience, especially for those in the back of the room.
Rule 4: Do not sit
Another good advice is to stand when giving a presentation. I have ever seen some people giving a presentation while seated. In general, if you are seated, then you will be less “dynamic”. It is always better to stand up to give a presentation.
For example, here is someone breaking two rules by seating and turning her back to the audience at an international conference:
Rule 5: Make simple slides
A very common problem that I observed in presentations at academic conferences is that presenters put way too much content on their slides. I will show you some pictures that I took at an academic conference for example:
A slide with too many formulas
In this picture, the problem is that there are too many technical details, and formulas. It is impossible for someone attending a 20 minutes presentations with slides full of formulas to read, understand and remember all these formulas, with all these symbols.
In general, when I give a presentation at a conference, I will not show all the details, formulas, or theorems. Instead, I will only give some minimum details so that the audience understand the basic idea of my work: what is the problem that I want to solve and the main intuition behind the solution. And I will try to explain some applications of my work and show some illustrations or simple examples to make it easy to understand. Actually, the goal of a paper presentation is that the audience understand the main idea of your work. Then, if someone from the audience wants to know all the technical details, he can read your paper.
If someone do like in the picture above by giving way too much technical details or show a lot of formulas during a presentations, then the audience will very quickly get lost in the details and stop following the presentation.
Here is another example:
A slide with way too much content
In the above slide, there are way too much text. Nobody from the audience will start to read all this text. To make a good presentation, you should try to make your slides as simple as possible. You should also not put full sentences but rather just put some keywords or short parts of sentences. The reason is that during the presentation you should not read the slides and the audience should also not read a long text on your slides. You should talk and the audience should listen to you rather than be reading your slides. Here is an example of a good slide design:
A powerpoint slide with just enough content
This slide has just enough content. It has some very short text that give only the main points. And then the presenter can talk to explain these points in more details while looking at the audience rather than reading the slides.
Conclusion
There are also a lot of other things that could be said about giving a good presentation but I did not want to write too much for today. Actually, giving good presentations is something that is learned through practice. The more that you practice giving presentations, the more that you will become comfortable to talk in front of many people.
Also, it is quite normal that a student may be nervous when giving a presentation especially if it is in a foreign language. In that case, it requires more preparation.
Hope that you have enjoyed this short blog post.
== Philippe Fournier-Vigeris a full professor and the founder of theopen-source data mining software SPMF,offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
In this blog post, I will provide a brief report about the 12th Intern. Conference on Machine Learning and Data Mining (MLDM 2016), that I have attended from the 18th to 20th July 2016 in Newark, USA.
First I have to say, that I was not very happy that the conference was held in Newark, which is one hour away from New York, because the conference had been advertised as being in New York until after our paper got accepted. As you can see on the map below, Newark is about 1 hour away from New York, and there was nothing around the conference location. It was a small hotel surrounded by highways. let me clarify, that the conference was not in New York as advertised, but instead in Newark, which is in a different state, one hour away from Newark by train:
About the conference
This is the 12th edition of the conference. The MLDM conference is co-located and co-organized with the 16th Industrial Conference on Data Mining 2016, that I have also attended this week. The proceedings of MLDM are published by Springer. Moreover, an extra book was offered containing two late papers, published by Ibai solutions. On the proceedings book, it is again claimed that the conference is in New York, while it was not.
The MLDM 2016 proceedings
Acceptance rate
The acceptance rate of the conference is about 33% (58 papers have been accepted from 169 submitted papers), which is reasonable.
First day of the conference
The first day of the MLDM conference started at 9:00 with an opening ceremony, followed by a keynote on supervised clustering. The idea of supervised clustering is to perform clustering on data that has already some class labels. Thus, it can be used for example to discover sub-class in existing classes. The class labels can also be used to evaluate how good some clusters are. One of the cluster evaluation measure suggested by the keynote speaker is the purity, that is the percentage of instances having the most popular class label in a cluster. The purity measure can be used to remove outliers from some clusters among other applications.
After the keynote, there was paper presentations for the rest of the day. Topics were quite varied. It included paper presentations about clustering, support vector machines, stock market prediction, list price optimization, image processing, automatic authorship attribution of texts, driving style identification, and source code mining.
The conference room
The conference ended at around 17:00 and was followed by a banquet at 18:00. There was about 40 persons attending the conference in the morning. Overall, there was some some interesting paper presentations and discussion.
Second day of the conference
The second day was also a day of paper presentations.
Second day of the conference (afternoon)
The topics of the second day included itemset mining algorithms, inferring geo-information about persons, multigroup regression, analyzing the content of videos, time-series classification, gesture recognition (a presentation by Intel) and analyzing the evolution of communities in social networks.
I have presented two papers during that day (one by me and one by my colleague), including a paper about high-utility itemset mining.
Third day of the conference
The third day of the conference was also paper presentations. There was various topics such as image classification, image enhancement, mining patterns in cellular radio access network data, random forest learning, clustering and graph mining.
Conclusion
It was globally an interesting conference but I was not very happy about the location, and this conference is rather expensive at around 650 euros.
I have attended both the Industrial Conference on Data Mining and MLDM conference this week. The MLDM is more focused on theory and the Industrial Conference on Data Mining conference is more focused on industrial applications. MLDM is a slightly bigger conference.
Would I attend this conference again? No. I did not appreciate that it was not held in New York. Some other attendees were also unhappy about this.
==
Philippe Fournier-Vigeris a full professor and the founder of theopen-source data mining software SPMF,offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
In this blog post, I will provide a brief report about the 16thIndustrial Conference on Data mining 2016, that I have attended from the 13 to 14 July 2016 in New York, USA.
About the conference This conference is an established conference in the field of data mining. It is the 16th edition of the conference. The main proceedings are published by Springer, which ensures that the conference proceedings are well-indexed, while the poster proceedings are published by Ibai solutions.
The proceedings for full papers and posters
The acronym of this conference is ICDM (Industrial Conference on Data Mining). However, it should not be confused with IEEE ICDM (IEEE International Conference on Data Mining).
Acceptance rate
The quality of the papers at this conference has been fine. 33 papers have been accepted and it was said during the conference that the acceptance rate was 33%. It is interesting though that this conference attracts many papers related to industry applications due to the name of the conference. There was some paper presentations from IBM Research and Mitsubishi.
Conference location
In the past, this conference has been held mostly in Germany, as the organizer is German. But this year, it is organized in New York, USA, and more specifically at a hotel in Newark, New Jersey beside New York. It is interesting that the conference is collocated and co-organized with the MLDM conference. Hence, it is possible to spend 9 days in Newark to attend both conferences. About MLDM 2016, you can read my report about this conference here: MLDM 2016.
First day of the conference
The conference is two days (excluding workshops and tutorials which requires additional fees). The first day was scheduled to start at 9:00 and finish at around 18:00.
The conference room
The conference started by an opening ceremony. After the opening ceremony, there was supposed to be a keynote but the keynote speaker could unfortunately not attend the conference due to personal reasons.
Then, there was paper presentations. The main topics of the papers have been applications in medicine, algorithms for itemset mining, applying clustering to analyze data about divers, text mining, prediction of rainfall using deep learning, analyzing GPS traces of truck drivers, and preventing network attacks in the cloud using classifiers. I presented a paper about high-utility itemset mining and periodic pattern mining.
After the paper presentations, there was a poster session. One of the poster applied data mining in archaeology, which is a quite unusual application of data mining.
The poster session
Finally, there was a banquet at the same hotel. The food was fine. Since, the conference is quite small, there was only three tables of 10 persons. But a small conference also means a more welcoming atmosphere and more proximity between participants, such that it was easy to discuss with everyone.
Second day of the conference
The second day of the conference started at 9:00. It consisted of paper presentations. The topics of the paper have covered topics such as association rules, alarm monitoring, distributed data mining, process mining, diabetes age prediction and data mining to analyse wine ratings.
The second day
Best paper award
One thing that I think is great about this conference is that there is a 500 € prize for the best paper. Not many conferences have a cash prize for the best paper award. Also, there was three nominated papers for the best paper award, and each nominee got a certificate. As one of my paper got nominated, I got a nice certificate (though I did not receive the best paper award).
Best paper nomination certificate
Conclusion
It was the first time that I have attended this conference. It is a small conference compared to some other conferences. It is not a first tier conference in data mining. But it is still published by Springer. At this conference, I did not met anyone working exactly in my field, but I still had the opportunity to meet some interesting researchers. The MLDM conference that I have also attended after that conference was bigger.
== Philippe Fournier-Vigeris a full professor and the founder of theopen-source data mining software SPMF,offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
A key question for data mining and data science researchers is to know what are the top journals and conferences in the field, since it is always best to publish in the most popular journals or conferences. In this blog post, I will look at four different rankings of data mining journals and conferences based on different criteria, and discuss these rankings.
1) The Google Ranking of data mining and analysis journals and conferences
A first ranking is the Google Scholar Ranking (https://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_datamininganalysis). This ranking is automatically generated based on the H5 index measure. The H5 index measure is described as “the h-index for articles published in the last 5 complete years. It is the largest number h such that h articles published in 2011-2015 have at least h citations each“. The ranking of the top 20 conferences and journals is the following:
Publication
h5-index
h5-median
Type
1
ACM SIGKDD International Conference on Knowledge discovery and data mining
67
98
Conference
2
IEEE Transactions on Knowledge and Data Engineering
66
111
Journal
3
ACM International Conference on Web Search and Data Mining
58
94
Conference
4
IEEE International Conference on Data Mining (ICDM)
39
64
Conference
5
Knowledge and Information Systems (KAIS)
38
52
Journal
6
ACM Transactions on Intelligent Systems and Technology (TIST)
37
68
Journal
7
ACM Conference on Recommender Systems
35
64
Conference
8
SIAM International Conference on Data Mining (SDM)
35
54
Conference
9
Data Mining and Knowledge Discovery
33
57
Journal
10
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
30
56
Journal
11
European Conference on Machine Learning and Knowledge Discovery in Databases (PKDD)
30
36
Conference
12
Social Network Analysis and Mining
26
37
Journal
13
ACM Transactions on Knowledge Discovery from Data (TKDD)
23
39
Journal
14
International Conference on Artificial Intelligence and Statistics
23
29
Conference
15
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)
22
30
Conference
16
IEEE International Conference on Big Data
18
30
Conference
17
Advances in Data Analysis and Classification
18
25
Journal
18
Statistical Analysis and Data Mining
17
30
Journal
19
BioData Mining
17
25
Journal
20
Intelligent Data Analysis
16
21
Journal
Some interesting observations can be made from this ranking:
It shows that some conferences in the field of data mining actually have a higher impact than some journals. For example, the well-known KDD conference is ranked higher than all journals.
It appears strange that the KAIS journal is ranked higher than DMKD/DAMI and the TKDD journals, which are often regarded as better journals than KAIS. However, it may be that the field is evolving, and that KAIS has really improved over the years.
2) The Microsoft Ranking of data mining conferences
Another automatically generated ranking is the Microsoft ranking of data mining conferences (http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=7&orderby=1). This ranking is based on the number of publications and citations. Contrarily to Google, Microsoft has separated rankings for conferences and journals.
Besides, Microsoft offers two metrics for ranking : the number of citations and the Field Rating. It is not very clear how the “field rating” is calculated by Microsoft. The Microsoft help center describes it as follows: “The Field Rating is similar to h-index in that it calculates the number of publications by an author and the distribution of citations to the publications. Field rating only calculates publications and citations within a specific field and shows the impact of the scholar or journal within that specific field“.
Here is the conference ranking of the top 30 conferences by citations:
Rank
Conference
Publications
Citations
1
KDD – Knowledge Discovery and Data Mining
2063
69270
2
ICDE – International Conference on Data Engineering
4012
67386
3
CIKM – International Conference on Information and Knowledge Management
2636
28621
4
ICDM – IEEE International Conference on Data Mining
2506
18362
5
SDM – SIAM International Conference on Data Mining
708
9095
6
PKDD – Principles of Data Mining and Knowledge Discovery
994
8875
7
PAKDD – Pacific-Asia Conference on Knowledge Discovery and Data Mining
1255
6400
8
DASFAA – Database Systems for Advanced Applications
1251
4001
9
RIAO – Recherche d’Information Assistee par Ordinateur
574
3551
10
DMKD / DAMI – Research Issues on Data Mining and Knowledge Discovery
103
3264
11
DaWaK – Data Warehousing and Knowledge Discovery
503
2648
12
DS – Discovery Science
553
2256
13
Fuzzy Systems and Knowledge Discovery
4626
2171
14
DOLAP – International Workshop on Data Warehousing and OLAP
177
1830
15
IDEAL – Intelligent Data Engineering and Automated Learning
1032
1789
16
WSDM – Web Search and Data Mining
196
1499
17
GRC – IEEE International Conference on Granular Computing
1351
1434
18
ICWSM – International Conference on Weblogs and Social Media
238
1142
19
DMDW – Design and Management of Data Warehouses
70
993
20
FIMI – Workshop on Frequent Itemset Mining Implementations
32
849
21
MLDM – Machine Learning and Data Mining in Pattern Recognition
313
822
22
PJW – Workshop on Persistence and Java
41
712
23
ADMA – Advanced Data Mining and Applications
562
676
24
ICETET – International Conference on Emerging Trends in Engineering & Technology
712
376
25
WKDD – Workshop on Knowledge Discovery and Data Mining
527
342
26
KDID – International Workshop on Knowledge Discovery in Inductive Databases
70
328
27
ICDM – Industrial Conference on Data Mining
304
323
28
DMIN – Int. Conf. on Data Mining
434
278
29
MineNet – Mining Network Data
22
278
30
WebMine – Workshop on Web Mining
15
245
And here is the top 30 conferences by Field rating:
Rank
Conference
Publications
Field Rating
1
KDD – Knowledge Discovery and Data Mining
2063
122
2
ICDE – International Conference on Data Engineering
4012
104
3
CIKM – International Conference on Information and Knowledge Management
2636
67
4
ICDM – IEEE International Conference on Data Mining
2506
56
5
SDM – SIAM International Conference on Data Mining
708
45
6
PKDD – Principles of Data Mining and Knowledge Discovery
994
40
7
PAKDD – Pacific-Asia Conference on Knowledge Discovery and Data Mining
1255
33
8
RIAO – Recherche d’Information Assistee par Ordinateur
574
28
9
DMKD / DAMI – Research Issues on Data Mining and Knowledge Discovery
103
27
10
DASFAA – Database Systems for Advanced Applications
1251
26
11
DaWaK – Data Warehousing and Knowledge Discovery
503
22
12
DOLAP – International Workshop on Data Warehousing and OLAP
177
22
13
DS – Discovery Science
553
20
14
ICWSM – International Conference on Weblogs and Social Media
238
19
15
WSDM – Web Search and Data Mining
196
19
16
DMDW – Design and Management of Data Warehouses
70
19
17
PJW – Workshop on Persistence and Java
41
16
18
FIMI – Workshop on Frequent Itemset Mining Implementations
32
14
19
GRC – IEEE International Conference on Granular Computing
1351
13
20
IDEAL – Intelligent Data Engineering and Automated Learning
1032
13
21
MLDM – Machine Learning and Data Mining in Pattern Recognition
313
13
22
Fuzzy Systems and Knowledge Discovery
4626
11
23
ADMA – Advanced Data Mining and Applications
562
10
24
KDID – International Workshop on Knowledge Discovery in Inductive Databases
70
10
25
ICDM – Industrial Conference on Data Mining
304
9
26
MineNet – Mining Network Data
22
9
27
ESF Exploratory Workshops
17
8
28
TSDM – Temporal, Spatial, and Spatio-Temporal Data Mining
13
8
29
ICETET – International Conference on Emerging Trends in Engineering & Technology
712
7
30
WKDD – Workshop on Knowledge Discovery and Data Mining
527
7
Some observations:
The ranking by citations and by field rating are quite similar.
The KDD conference is still the #1 conference. This make sense, and also that the CIKM, ICDM and SDM conferences are among the top conferences in the field
PKDD is higher than PAKDD, which are higher than DASFAA and DAWAK as in the Google ranking, and I agree with this.
Some conferences were not in the Google ranking like ICDE. It may be because the Google ranking put the ICDE conference in a different category.
Microsoft rank DMKD / DAMI as a conference, while it is a journal.
The FIMI workshop is also ranked high although that workshops only occurred in 2003 and 2004. Thus, it seems that Microsoft has no restrictions on time. Actually, since the FIMI workshop was not help since 2004, it should not be in this ranking. The ranking would probably be better if Microsoft would consider only the last five years for example.
3)The Microsoft Ranking of data mining journals
Now let’s look at the top 20 data mining journals according to Microsoft, by citations.
Rank
Journal
Publications
Citations
1
IPL – Information Processing Letters
7044
62746
2
TKDE – IEEE Transactions on Knowledge and Data Engineering
2742
60945
3
CS&DA – Computational Statistics & Data Analysis
4524
24716
4
DATAMINE – Data Mining and Knowledge Discovery
584
19727
5
VLDB – The Vldb Journal
631
17785
6
Journal of Knowledge Management
747
9601
7
Sigkdd Explorations
491
9564
8
Journal of Classification
550
8041
9
KAIS – Knowledge and Information Systems
741
7639
10
WWW – World Wide Web
540
7182
11
INFFUS – Information Fusion
567
5617
12
IDA – Intelligent Data Analysis
477
4167
13
Transactions on Rough Sets
221
1653
14
JECR – Journal of Electronic Commerce Research
122
1577
15
TKDD – ACM Transactions on Knowledge Discovery From Data
110
716
16
IJDWM – International Journal of Data Warehousing and Mining
102
366
17
IJDMB – International Journal of Data Mining and Bioinformatics
132
256
18
IJBIDM – International Journal of Business Intelligence and Data Mining
124
251
19
Statistical Analysis and Data Mining
124
169
20
IJICT – International Journal of Information and Communication Technology
111
125
And here is the top 20 journals by Field Rating.
Rank
Journal
Publications
Field Rating
1
TKDE – IEEE Transactions on Knowledge and Data Engineering
2742
109
2
IPL – Information Processing Letters
7044
80
3
VLDB – The Vldb Journal
631
61
4
DATAMINE – Data Mining and Knowledge Discovery
584
57
5
Sigkdd Explorations
491
50
6
CS&DA – Computational Statistics & Data Analysis
4524
49
7
Journal of Knowledge Management
747
46
8
WWW – World Wide Web
540
42
9
Journal of Classification
550
37
10
INFFUS – Information Fusion
567
36
11
KAIS – Knowledge and Information Systems
741
33
12
IDA – Intelligent Data Analysis
477
28
13
JECR – Journal of Electronic Commerce Research
122
21
14
Transactions on Rough Sets
221
20
15
TKDD – ACM Transactions on Knowledge Discovery From Data
110
13
16
IJDMB – International Journal of Data Mining and Bioinformatics
132
8
17
IJDWM – International Journal of Data Warehousing and Mining
102
8
18
IJBIDM – International Journal of Business Intelligence and Data Mining
124
7
19
Statistical Analysis and Data Mining
124
7
20
IJICT – International Journal of Information and Communication Technology
111
5
Some observations:
The ranking by citations and field rating are quite similar.
The TKDE journal is again in the top of the ranking, just like in the Google ranking.
It make sense that the VLDB journal is quite high. This journal was not in the Google ranking probably because it is more a database journal than a data mining journal.
Sigkdd explorations is also a good journals, and it make sense to be in the list. However, I’m not sure that it should be higher than TKDD and DMKD / DAMI.
The KAIS journal is still ranked quite high. This time it is lower than DMKD / DAMI (unlike in the Google Ranking) but still higher than TKDD. This is quite strange. Actually, TKDD is arguably a better journal. As explained in the comment section of this blog post, a reason why KAIS is ranked so high may be because in the past, the journal has encouraged authors to cite papers from the KAIS journal. Besides, it appears that the Microsoft ranking has no restriction on time (it does not consider only the last five years for example).
It is also quite strange that “Intelligent Data Analysis” is ranked higher than TKDD.
Some journals like WWW and JECR should perhaps not be in this ranking. Although they publish data mining papers, they do not exclusively focus on data mining. And this is probably the reason why they are not in the Google ranking. On overall, the Microsoft ranking seems to be broader than the Google ranking.
4) Impact factor ranking
Now, another popular way of ranking journals is using their impact factor (IF). I have taken some of the top data mining journals above and obtained their Impact Factor from 2014/2015 or 2013, when I could not find the information for 2015. Here is the result:
Journal
Impact factor
DMKD/DAMI Data Mining and Knowledge Discovery
2.714
IEEE Transactions on Knowledge and Data Engineering
2.476
Knowledge and Information Systems
1.78
VLDB – The Vldb Journal
1.57
TKDD – ACM Transactions on Knowledge Discovery From Data
1.14
Advances in Data Analysis and Classification
1.03
Intelligent Data Analysis
0.50
Some observations:
TKDE and DAMI/DMKD are still among the top journal
As in the Microsoft ranking, DAMI/DMKD is above KAIS, which is above TKDD.
As pointed out in the comment section of this blog post, it is strange that KAIS is so high, compared for example to TKDD, or VLDB, which is a first-tier database journal. This shows that IF is not a perfect metric.
Compared to the Microsoft Ranking, the IF ranking at least has the “Intelligent Data Analysis” journal much lower than TKDD. This make sense, as TKDD is a better journal.
Conclusion
In this blog post, we have looked at three different rankings of data mining journals and conferences: the Microsoft ranking, the Google ranking, and the Impact Factor ranking.
All these rankings are not perfect. They are somewhat accurate but they may not always correspond to the actual reputation in the data mining field. The Google ranking is more focused on the data mining field, while the Microsoft ranking is perhaps too broad, and seems to have no restriction on time. Also, as it can be seen by observing these rankings, different measures yield different rankings. However, there are still some clear trends in these ranking such as TKDE being ranked as one of the top journal and KDD as the top conference in all rankings. The top journals and conferences are more or less the same in each ranking. But there are also some strange ranks such as KAIS and Intelligent Data Analysis being ranked higher than TKDD in the Microsoft ranking.
Do you agree with these rankings? Please leave your comments below!
Update 2016-07-19: I have updated the blog post based on the insightful comments made by Jefrey in the comment section. Thanks!
== Philippe Fournier-Vigeris a full professor and the founder of theopen-source data mining software SPMF,offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
In this blog post I will give an introduction to the discovery of periodic patterns in data. Mining periodic patterns is an important data mining task as patterns may periodically appear in all kinds of data, and it may be desirable to find them to understand data for taking strategic decisions.
For example, consider customer transactions made in retail stores. Analyzing the behavior of customers may reveal that some customers have periodic behaviors such as buying some products every week-end such as wine and cheese. Discovering these patterns may be useful to promote products on week-ends, or take other marketing decisions.
Another application of periodic pattern mining is stock market analysis. One may analyze the price fluctuations of stocks to uncover periodic changes in the market. For example, the stock price of a company may follow some patterns every month before it pays its dividend to its shareholders, or before the end of the year.
In the following paragraph, I will first present the problem of discovering frequent periodic patterns, periodic patterns that appear frequently. Then, I will discuss an extension of the problem called periodic high-utility pattern mining that aims at discovering profitable patterns that are periodic. Moreover, I will present open-source implementations of popular algorithms such as PFPMand PHMfor mining periodic patterns.
The problem of mining frequent periodic patterns
The problem of discovering periodic patterns can be generally defined as follows. Consider a database of transactions depicted below. Each transaction is a set of items (symbols), and transactions are ordered by their time.
A transaction database
Here, the database contains seven transactions labeled T1 to T7. This database format can be used to represent all kind of data. However, for our example, assume that it is a database of customer transactions in a retail store. The first transaction represents that a customer has bought the items a and c together. For example, a could mean an apple, and c could mean cereals.
Having such a database of objects or transactions, it is possible to find periodic patterns. The concept of periodic patterns is based on the notion of period.
A period is the time elapsed between two occurrences of a pattern. It can be counted in terms of time, or in terms of a number of transactions. In the following, we will count the length of periods in terms of number of transactions. For example, consider the itemsets (set of items) {a,c}. This itemset has five periods illustrated below. The number that is used to annotate each period is the period length calculated as a number of transactions.
The first period of {a,c} is what appeared before the first occurrence of {a,c}. By definition, if {a,c} appears in the first transaction of the database, it is assumed that this period has a length of 1.
The second period of {a,c} is the gap between the first and second occurrences of {a,c}. The first occurrence is in transaction T1 and the second occurrence is in transaction T3. Thus, the length of this period is said to be 3 – 1 = 2 transactions.
The third period of {a,c} is the gap between the second and third occurrences of {a,c}. The second occurrence is in transaction T3 and the third occurrence is in transaction T5. Thus, the length of this period is said to be 5 – 3 = 2 transactions.
The fourth period of {a,c} is the gap between the third and fourth occurrences of {a,c}. The third occurrence is in transaction T5 and the fourth occurrence is in transaction T6. Thus, the length of this period is said to be 6 – 5 =1 transactions.
Now, the fifth period is interesting. It is defined as the time elapsed between the last occurrence of {a,c} (in T6) and the last transaction in the database, which is T7. Thus, the length of this period is also 7 – 6 = 1 transaction.
Thus, in this example, the list of period lengths of the pattern {a,c} is: 1,2,2,1,1.
Several algorithms have been designed to discover periodic patterns in such databases, such as PFP-Tree, MKTPP, ITL-Tree, PF-tree, and MaxCPF algorithms. For these algorithms, a pattern is said to be periodic, if: (1) it appears in at least minsup transactions, where minsup is a number of transactions set by the user, (2) and the pattern has no period of length greater than a maxPer parameter also set by the user.
Thus, according to this definition if we consider minsup = 3 and maxPer =2, the itemset {a,c} is said to be periodic because it has no period of length greater than 2 transactions, and it appears in at least 3 transactions.
This definition of a periodic pattern is, however, too strict. I will explain why with an example. Assume that maxPer is set to 3 transactions. Now, consider that a pattern appears every two transactions many times but that only once it appears after 4 transactions. Because the pattern has a single period greater than maxPer, this pattern would be automatically be deemed as non periodic although it is in general periodic. Thus, this definition is too strict.
Thus, in a 2016 paper, we proposed a solution to this issue. We introduced two new measures called the average periodicity and the minimum periodicity. The idea is that we should not discard a pattern if it has a single period that is too large but should instead look at how periodic the pattern is on average. The designed algorithm is called PFPM and the search procedure is inspired by the ECLAT algorithm. PFPM discovers periodic patterns using a more flexible definition of what is a periodic pattern. A pattern is said to be periodic pattern if: (1) the average length of its periods denoted as avgper(X) is not less than a parameter minAvg and not greater than a parameter maxAvg. (2) the pattern has no period greater than a maximum maxPer. (3) the pattern has no period smaller than a minimum minPer.
This definition of a periodic pattern is more flexible than the previous definition and thus let the user better select the periodic patters to be found. The user can set the minimum and maximum average as the main constraints for finding periodic patterns and use the minimum and maximum as loose constraints to filter patterns having periods that vary too widely.
Based on the above definition of periodic patterns, the problem of mining all periodic patternsin a database is to find all periodic patterns that satisfy the constraints set by the user. For example, if minPer = 1, maxPer = 3, minAvg = 1, and maxAvg = 2, the 11 periodic patterns found in the example database are the one shown in the table below. This table indicates the measures of support (frequency), minimum, average and maximum periodicity for each pattern found:
As it can be observed in this example, the average periodicity can give a better view of how periodic a pattern is. For example, considers the patterns {a,c} and {e}. Both of these patterns have a largest period of 2 (called the maximum periodicity), and would be considered as equally periodic using the standard definition of a periodic pattern. But the average periodicity of these two patterns is quite different. The average periodicity measure indicates that on average {a,c} appears with a period of 1.4 transactions, while {e} appears on average with a period of 1.17 transaction.
Discovering periodic patterns using the SPMF open-source data mining library
An implementation of the PFPM algorithm for discovering periodic patterns, and its source code, can be found in the SPMF data mining software. This software allows to run the PFPM algorithm from a command line or graphical interface. Moreover, the software can be used as a library and the source code is also provided under the GPL3 license. For the example discussed in this blog post, the input database is a text file encoded as follows:
3 153 5 1 2 43 5 2 43 1 43 5 13 5 2
where the numbers 1,2,3,4,5 represents the letters a,b,c,d,e. To run the algorithm from the source code, the following lines of code need to be used:
String output = "output.txt";
String input = "contextPFPM.txt";
int minPeriodicity = 1;
int maxPeriodicity = 3;
int minAveragePeriodicity = 1;
int maxAveragePeriodicity = 2;
// Applying the algorithm
AlgoPFPM algorithm = new AlgoPFPM();
algorithm.runAlgorithm("input.txt", "output.txt", minPeriodicity, maxPeriodicity, minAveragePeriodicity, maxAveragePeriodicity);
The output is then a file containing all the periodic patterns shown in the table.
For example, the eighth line represents the pattern {a,c}. It indicates that this pattern appears in four transactions, that the smallest and largest periods of this pattern are respectively 1 and 2 transactions, and that this pattern has an average periodicity of 1.4 transactions.
SPMF also offers many other algorithms for periodic pattern mining. This includes algorithms for many variations of the problem of periodic pattern mining.
Related problems
There also exist extensions of the problem of discovering periodic patterns. For example, another algorithm offered in the SPMF library is called PHM. It is designed to discover “periodic high-utility itemsets” in customer transaction databases. The goal is not only to find patterns that appear periodically, but also to discover the patterns that yields a high profit in terms of sales.
Another related problem is to find stable periodic patterns. A periodic pattern is said to be stable if its periods are generally more or less the same over time. The SPP-Growth and TSPIN algorithms were designed for this problem.
Another related problem is to find periodic patterns in multiple sequences of transactions instead of a single one. For example, one may want to find patterns that are periodic not only for one customer but periodic for many customers in a store.
Another topic is to find patterns that are locally periodic. This means that a pattern may not always be periodic. Thus the goal is to find patterns and the intervals of time where they are periodic.
If you want to know more about periodic pattern mining, you can watch some videos lectures that I have recorded on this topic, that are easy to understand. They are available on my website:
Also, you can find more videos on my Youtube channel. And if you want to know more about periodic pattern mining in general, you can also check my free online pattern mining course, which covers many other related topics.
Conclusion
In this blog post, I have introduced the problem of discovering periodic patterns in databases. I have also explained how to use open-source software to discover periodic patterns. Mining periodic patterns is a general problem that may have many applications.
There are also many research opportunities related to periodic patterns, as this subject as not been extensively studied. For example, some possibility could be to transform the proposed algorithms into an incremental algorithm, a fuzzy algorithm, or to discover more complex types of periodic patterns such as periodic sequential rules.
Also, in this blog post, I have not discussed about time series (sequences of numbers). It is also another interesting problem to discover periodic patterns in time series, which requires different types of algorithms.
== Philippe Fournier-Vigeris a full professor and the founder of theopen-source data mining software SPMF,offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.