Tutorial: Discovering hidden patterns in texts using SPMF

This tutorial will explain how to analyze text documents to discover complex and hidden relationships between words.  We will illustrate this with a Sherlock Holmes novel. Moreover we will explain how hidden patterns in text can be used to recognize the author of a text.

The Java open-source SPMF data mining library will be used in this tutorial. It is a library designed to discover patterns in various types of data, including sequences, which can also be used as a standalone software, and to discover patterns in other types of files. Handling text documents is a new feature of the most recent release of SPMF (v.2.01).

Obtaining a text document to analyze

spmf8

The first step of this tutorial is to obtain a text document that we will analyze. A simple way of obtaining text documents is to go on the website of  Project Gutenberg, which offers numerous public domain books.  I have here chosen the novel “Sherlock Holmes: The Hound of the Baskervilles” by Arthur Conan Doyle.   For the purpose of this tutorial, the book can be downloaded here as a single text file: SHERLOCK.text.  Note that I have quickly edited the book to remove unnecessary information such as the table of content that are not relevant for our analysis. Moreover, I have renamed the file so that it has the extension “.text” so that SPMF recognize it as a text document.

Downloading and running the SPMF software

The first task that we will perform is to find the most frequent sequences of words in the text. We will first download the SPMF software from the SPMF website by going to the download page.  On that webpage, there is a lot of detailed instructions explaining how the software can be installed. But for the purpose of this tutorial, we will directly download spmf.jar, which is the version of the library that can be used as a standalone program with a user interface.

Now, assuming that you have Java installed on your computer, you can double-click on SPMF.jar to launch the software. This will open a window like this:

spmf1

Discovering hidden patterns in the text document

Now, we will use the software to discover hidden patterns in the Sherlock Holmes novel. There are many algorithms that could be applied to find patterns.  We will choose the TKS algorithm, which is an algorithm for finding the k most frequent subsequences in a set of sequences.  In our case, a sequence is a sentence. Thus, we will find the k most frequent sequences of words in the novel. Technically, this type of patterns is called skip-grams. Discovering the most frequent skip-grams is done as follows.

A) Finding the K most frequent sequences of words (skip-grams)

spmf2

  1. We will choose the TKS algorithm
  2. We will choose the file SHERLOCK.text as input file
  3. We will enter the name  test.txt as output file for storing the result
  4. We will set the parameter of this algorithm to 10 to find the 10 most frequent sequences of words.
  5. We will click the “Run algorithm” button.

The result is a text file containing the 10 most frequent patterns

spmf3

The first line for example indicates that the word “the” is followed by “of” in 762 sentences from the novel. The second line indicates that the word “in” appears in 773 sentences from the novel. The third line indicates that the word “the” is followed by “the” in 869 sentences from the novel. And so on. Thus we will next change the parameters to find consecutive sequences of words

B) Finding the K most frequent consecutive sequences of words (ngrams)

The above patterns were not so interesting because most of these patterns are very short. To find more interesting patterns, we will set a minimum pattern length for the patterns found to 4. Moreover, another problem is that some patterns such as “the the” contains gaps between words. Thus we will also specify that we do not want gaps between words by setting the max gap constraint to 1.  Moreover, we will increase the number of patterns to 25. This is done as follows:

spmf18

  1. We set the number of patterns t0 25
  2. We set the minimum length of patterns to 4 words
  3. We require that there is no gap between words (max gap = 1)
  4. We will click the “Run algorithm” button.

The result is the following patterns:

spmf19

Now this is much more interesting. It shows some sequences of words that the author of Sherlock Holmes tends to use repeatedly.  The most frequent 4-word sequences is “in front of us” which appears 13 times in this story.

It would be possible to further adjust the parameters to find other types of patterns. For example, using SPMF, it is also possible to find all patterns having a frequency higher than a threshold. This can be done for example with the CM-SPAM algorithm. Let’s try this

C) Finding all sequences of words appearing frequently

We will use the CM-SPAM algorithm to find all patterns of at least 2 words that appear in at least 1 % of the sentences in the text. This is done as follows:

spmf119

  1. We choose the CM-SPAM algorithm
  2. We set the minimum frequency to 1 % of the sentences in the text
  3. We require that patterns contain at least two words
  4. We will click the “Run algorithm” button.

The result is 113 patterns. Here is a few of them:

spmf11229

Here there are some quite interesting patterns. For example, we can see that “Sherlock Holmes” is a frequent pattern appearing 31 times in the text, and that “sir Charles” is actually more frequent than “Sherlock Holmes”. Other patterns are also interesting and give some insights about the writing habits of the author of this novel.

Now let’s try another type of patterns.

D) Finding sequential rules  between words 

We will now try to find sequential rules. A sequential rule X-> Y is sequential relationship between two unordered sets of words appearing in the same sentence. For example, we can apply the ERMiner algorithm to discover sequential rules between words and see what kind of results can be obtained. This is done as follows.

spmf999

  1. We choose the ERMiner algorithm
  2. We set the minimum frequency to 1 % of the sentences in the text
  3. We require that rules have a confidence of at least 80% (a rule X->Y has a confidence of 80% if the unordered set of words X is followed by the unordered set of words Y at least 80% of the times when X appears in a sentence)
  4. We will click the “Run algorithm” button.

The result is a set of three rules.

spmfrules

The first rule indicates that every time that 96 % of the time, when “Sherlock” appears in a sentence, it is followed by “Holmes” and that “Sherlock Holmes” appeared totally 31 times in the text.

For this example, I have chosen the parameters to not obtain too many rules. But it is possible to change the parameters to obtain more rules for example by changing the minimum confidence requirement.

Applications of discovering patterns in texts

Here we have shown how various types of patterns can be easily extracted from text files using the SPMF software. The goal was to give an overview of some types of patterns that can be extracted. There are also other algorithms offered in SPMF, which could be used to find other types of patterns.

Now let’s talk about the applications of finding patterns in text. One popular application is called “authorship attribution“. It consists of extracting patterns from a text to try to learn about the writing style of an author. Then the patterns can be used to automatically guess the author of an anonymous text.

spmf8888

For example, if we have a set of texts written by various authors, it is possible to extract the most frequent patterns in each text to build a signature representing each author’s writing style. Then, to guess the author of an anonymous text, we can compare the patterns found in the anonymous text with the signatures of each author to find the most similar signature.  Several papers have been published on this topic. Besides using words for authorship attribution, it is also possible to analyze the Part-of-speech tags in a text. This requires to first transform a text into sequences of part-of-speeches. I will not show how to do this in this tutorial. But it is the topic of a few papers that I have recently published with my student. We have also done this with the SPMF software. If you are curious and want to know more about  this, you may look at my following paper:

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91.

There are also other possible applications of finding patterns in text such as plagiarism detection.

Conclusion

In this blog post, I have shown how open-source software can be used to easily find patterns in text.  The SPMF library can be used as a standalone program or can be called from other Java program.  It offers many algorithms with several parameters to find various types of patterns.  I hope that you have enjoyed this tutorial. If you have any comments, please leave them below.

==
Philippe Fournier-Viger is a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Data Mining, Data science, open-source | 2 Comments

Plagiarism by Kemal Koche and Shweta Dubey from J. D. College of Engineering & Management (Nagpur, India)

This is just a quick blog post to mention that I have found another serious case of plagiarism in a journal. Some “researchers” from India have plagiarized my ICDM 2011 paper about closed high utility itemsets. As usual, when I find a new case of plagiarism, I first contact the editor of the journal to let them know so that the paper will be retracted. Moreover, I also post it on this blog so that people can know that these people have plagiarized my paper.

First plagarized paper by Shweta Dubey and Kemal Koche

“Mining of High Utility Itemsets Efficiently using AprioriHC, AprioriHC-D and d2HUP Algorithms without Applicant”
by Shweta Dubey ( mailshwetadubey@gmail.com ) and Kemal Koche
from the
J. D. College of Engineering & Management Nagpur, India
( link: http://www.aijet.in/v3/1606013.pdf   ).

Plagiarism by Shweta Dubey and Kemal Koche

Plagiarism by Shweta Dubey and Kemal Koche

This paper is published in a journal called ( http://www.aijet.in/ ) “An  International Journal of Engineering & Technology (AIJET), which may perhaps does not peer review the papers because the plagiarized paper is very badly written, and it does not make too much sense because they have plagiarized other parts from other papers too.  It is interesting that the authors of that plagiarized paper copy a lot of content from my paper, and they even take screenshots of the math formulas in my paper because perhaps that they don’t know how to type math formulas in Microsoft Word (!):

Plagiarism by using screenshots of the definitions

Plagiarism by using a screenshot of the definitions

I will also send an e-mail to the head of their department or school from  J. D. College of Engineering & Management Nagpur, to notify them about this.

Second plagarized paper by Shweta Dubey and Kemal Koche

Moreover, I have found that the same persons have published another plagiarized paper.

A SURVEY PAPER ON HIGH UTILITY ITEMSETS MINING
Mrs. Shweta A. Dubey*, Prof. Kemal. Koche * M.tech Student: Dept. of Computer Sci. and Engg. J. D. College of Engineering and Management, Nagpur, Maharashtra, India
Dept. of Computer Sci. and Engg.J. D. College of Engineering and Management, Nagpur, Maharashtra, India

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY (IJESRT)

Another plagiarized paper by shweta dubey and Kemal. Koche from the J.D. college of engineering

Another plagiarized paper by shweta dubey and Kemal. Koche
from the J.D. college of engineering

Plagiarism by Kemal Koche

Plagiarism by Kemal Koche

In this paper, the authors again copy some content from my ICDM 2011 paper. But moreover, they even copy some text from the SPMF website about how to use the SPMF software, and they claim that this is a new algorithm. Again, the paper does not make much sense and indicates that this journal called IJESRT probably even does not review the papers that it publishes.

Who are the authors of these plagiarized papers?

The second author Kemal Koche appears to be a professor at J.D. college of Engineering. Here is a picture from his ResearchGate profile:

Kemal U Koche

Kemal U Koche from JD College of Engineering, Nagpur India

His picture can also be found on the webpage of his department ( http://www.jdcoem.ac.in/faculty.php?dept_id=11 ):

Kemal Koche

Kemal Koche

Conclusion

It is not the first time that some researchers plagiarize my papers. It is at least the fifth time that people from India plagiarize my papers. Besides, in the past two people from France have plagiarized my papers, as well as two persons from Algeria.

Anyway, I just wanted to post about this so that people who search for the names of these people can find they that have plagiarized my paper.  If there is some news about this, I will update the blog post.

==
Philippe Fournier-Viger is a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Plagiarism | Leave a comment

Brief report about the Dexa 2016 and Dawak 2016 conferences

This week, I have been attending the DEXA 2016 and DA‎WAK 2016 conferences, in Porto, Portugal, from the 4th to 8th September 2016, to present three papers. In this blog post, I will give a brief report about these conferences.

DEXA 2016

About these conferences

The DEXA conference is a well-established international conference related to database and expert systems. This year, it was the 27th edition of the conference. It is a conference that is  held in Europe, every year, but still attracts a considerable amount of researchers from other continents.

DEXA is a kind of multi-conference. It actually consists of 6 small conferences that are organized together. Below, I provide a description of each of those sub-conferences and indicate their acceptance rate.

  • DEXA 2016 (27th Intern. Conf.  on Database and Expert Systems Applications).
    Acceptance rate: 39 / 137 = 28% for full papers, and another 29 / 137 = 21 % for short papers
  • DaWaK 2016 (18th Intern. Conf.  on Big Data Analytics and Knowledge Discovery)
    Acceptance rate: 25 / 73= 34% 
  • EGOVIS 2016  (5th Intern. Conf.  on Electronic Government and the Information Systems Perspective)
    Acceptance rate: not disclosed in the proceedings, 22 papers published
  • ITBAM 2016  (7th Intern. Conf.  on Information Technology in Bio- and Medical Informatics)
    Acceptance rate: 9 / 26 = 36 % for full papers,  and another 11 / 26 = 42% for short papers
  • TrustBus 2016 (13th Intern. Conf.  on Trust, Privacy, and Security in Digital Business)
    Acceptance rate: 25 / 73= 43%
  • EC-Web 2016 (17th Intern. Conf.  on Electronic Commerce and Web Technologies)

Thus, the DEXA conference is more popular than DAWAK and the other sub-conferences and also is more competitive in terms of acceptance rate than the other sub-conferences.

Proceedings

The proceedings of each of the first five sub-conferences are published by Springer in the Lecture Notes in Computer Science Series, which is quite nice, as it ensures that the papers are indexed by major indexes in computer science such as DBLP.  The proceedings of the conferences were given on a USB drive.

DEXA 2016 proceedings

badge and proceedings of DEXA

The conference location

The conference was locally organized  by the Instituto Superior de Engenharia do Porto (ISEP) in Porto, Portugal. The location has been great, as Porto is a beautiful European city with a long history. The old town of Porto is especially beautiful. Moreover, visiting Porto is quite inexpensive.

c2

z4

First day of the conference

The first day of the conference started at 10:30 AM. The first day was mostly paper presentations.  The main topics of the papers during the first day of DEXA  were temporal databases, high-utility itemset mining, periodic pattern mining, privacy-preserving data mining, and clustering.  In particular, I had two papers presentations related to itemsets mining:

  • a paper presenting a new type of patterns called minimal high-utility itemsets
  • a paper about discovering high utility itemsets with multiple thresholds.

Besides, there was a keynote entitled “From Natural Language to Automated Reasoning” by Bruno Buchberger from Austria, a famous researcher in the field of symbolic computation. The keynote was about using formal automated reasoners (e.g. math theorem prover) based on logic to analyze texts. For example, the speaker proposed to extract formal logic formulas from tweets to then understand their meaning using automated reasoners and a knowledge base provided by the user. This was a quite unusual perspective on tweet analysis since nowadays, researchers in natural language processing prefer using statistical approaches to analyze texts rather than using approaches relying on logic and a knowledge base. This gave rise to some discussion during the questions period after the keynote.

DEXA 2016 keynote speech

DEXA 2016 first keynote speech

During the evening, there was also a reception in a garden inside the institute were the conference was held.

DEXA 2016 reception

Second day of the conference

On the second day,  I have attended DAWAK. In the morning,  there was several paper presentations. I have presented a paper about recent high-utility itemset mining.  The idea is to discover itemsets (set of items) that have been recently profitable in customer transactions, to then use this insight for marketing decisions.  There was also an interesting paper presentation about big data itemset mining by student Martin Kirchgessner from France.

Then, there was an interesting keynote about the price of data by Gottfried Vossen from Germany. This talk started by discussing the fact that companies are collecting more and more rich data about persons. Usually,  many persons give personal data for free to use services such as Facebook or Gmail.  There also exist several marketplaces where companies can buy data such as the Microsoft Azure Marketplace and also different pricing models for data. For example,  one could have different pricing models to sell more or less detailed views of the same data.  There also exist repositories of public data. Moreover other issues are what happen with the data of someone when he dies. In the future,  a new way of buying products could be to pay for data about the design of an object,  and then print it by ourselves using 3d printers or other tools. Other issues related to the sale of data is DRM,  selling second-hand data,  etc. Overall, it was not a technical presentation,  but it discussed an important topic nowadays which is the value of data in our society that relies more and more on technologies.

z5 z6

Third day of the conference

On the third day of the conference, there was more paper presentations and also a keynote that I have missed.  On the evening, there was a nice banquet at a wine cellar named Taylor. We had the pleasure to visit the cellar, enjoy a nice dinner, and listen to a Portuguese band at the end of evening.

z2 z1

Conclusion

This was globally a very interesting conference. I had opportunity to discuss with some excellent researchers, especially from Europe, including some that I had met at other conferences. There was also some papers quite related to my sub-field of research in data mining. DEXA may not be a first tier conference, but it is a very decent conference, that I would submit more papers to in the future.

DEXA 2017 / DAWAK 2017 will be held in Lyon, France..

==
Philippe Fournier-Viger is a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Conference, Data Mining, Research | 2 Comments

Brief report about the IEA AIE 2016 conference

This week, I have attended the IEA AIE 2016 conference, held at MoriokaJapan from the 2nd to the 4th August 2016. In this blog post, I will briefly discuss the conference.

c

About the conference

IEA AIE 2016 (29th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems) is an artificial intelligence conference with a focus on applications of artificial intelligence. But this conference also accepts theoretical papers, as well as data science papers.  This year, 85 papers have been accepted. The proceedings has been given on a  USB drive:

a

There was also the option to buy the printed proceedings, though:

IEA AIE 2016 proceedings

First day of the conference

Two keynote speeches were given on the first day of the conference. For the second keynote, I have seen about 50 persons attending the conference. Many people were coming from Asia, which is understandable since the conference is held in Japan. 
b

The paper presentations on the first day were about topics such as Knowledge-based Systems, Semantic Web, Social Network (clustering, relationship prediction), Neural Networks, Evolutionary Algorithms and Heuristic Search, Computer Vision and Adaptive Control.

Second day

On the second day of the conference, there was a great keynote talk by Prof. Jie Lu from Australia about recommender systems. She first introduced the main recommendation approaches (content-based filtering, collaborative filtering, knowledge-based recommendation, and hybrid approaches) and some of the typical problems that recommender systems face (cold-start problem, sparsity problem, etc.).  She then talked about her recent work on extensions of the recommendation problem such as group recommendation (e.g. recommending a restaurant that will globally satisfy or not disapoint a group of persons), trust-based recommendation (e.g. a system that recommend products to you based on what friends that you trust have liked, or the friends of your friends),  fuzzy recommender systems (a recommender systems that consider each item can belong to more than one category), and cross-domain recommendation (e.g. if you like reading books about Kung Fu, you may also like watching movies about Kung Fu).

After that there was several paper presentations. The main topics were Semantic Web, Social networks, Data Science, Neural Networks, Evolutionary Algorithms, Heuristic Search, Soft Computing and Multi-Agent Systems.

In the evening, there was a great banquet at the Morioka Grand Hotel, a nice hotel located on a mountain, on the outskirts of the city.

k3

Moreover, during the dinner, there was a live music band:

k2

Third day

On the third day, there was an interesting keynote on robotics by Prof. Hiroshi Okuno from the Waseda University / University of Tokyo. His team has proposed an open-source sofware called Hark for robot audition. Robot audition means the general process by which a robot can process sounds from the environment. The sofware which is the results of years of resutls has been used in robots equipped with arrays of microphones. By using the Hark library, robots can listen to multiple persons talking to the robot at the same time, localize where the sounds came from, and isolate sounds, among many other capabilities.

k1

It was followed by paper presentations on topics such as Data Science (KNN, SVM, itemset mining, clustering), Decision Support Systems,  Medical Diagnosis and Bio-informatics, Natural Language Processing and Sentiment Analysis. I have presented a paper about high utility itemset mining for a new algorithm called FHM+.

Location

The location of the conference is Morioka, a not very large city in Japan. However, the timing of the conference was perfect. It was held during the Sansa Odori festival, one of the most famous festival in Japan. Thus, during the evenings, it was possible to watch the Sansa parade, where people wearing traditional costumes where playing Taiko drums and dancing in the streets.

efef

Conclusion

The conference has been quite interesting. Since it is a quite general conference, I did not discuss with many people that were close to my research area. But I met some interesting people, including some top researchers.  During the conference, it was announced that the conference IEA AIE 2017 will be held in Arras (close to Paris, France).

==

Philippe Fournier-Viger is a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

 

Posted in artificial intelligence, Conference, Data Mining, Data science, Research | Tagged , , | 2 Comments

How to give a good paper presentation at an academic conference?

In this blog post, I will discuss how to give a good paper presentation at an academic conference. If you are a researcher, this is an important topic for you researcher because giving a good presentation of your work will raise interest in your work.  In fact, a good researcher should not only be good at doing research, but also be good at communicating the results of this research in terms of written and oral communication.

Rule 1 : Prepare yourself, and know the requirements

Giving a good oral presentation starts with a good preparation.  One should not prepare his presentation the day before the presentation but a few days before, to make sure that there will be enough time to prepare well. A common mistake is to prepare a presentation the evening before the presentation. In that case, the person may finish preparing the presentation late, not sleep well, be tired, and as a result give a poor presentation.

Preparing a presentation does not means only to design some Powerpoint slides. It also means to practice your presentation. If you are going to give a paper presentation at a conference, you should ideally practice giving your presentation several times in your bedroom or in front of friends before giving the presentation in front of your audience. Then, you will be more prepared, you will feel less nervous,  and you will give a better presentation.

It is also important to understand the context of your presentation: (1) who will attend the presentation? (2) how long the presentation should be ? (3) what kind of equipment will be available to do the presentation (projector, computer, room, etc.) ?, (4) what is the background of the people attending the presentation?   These questions are general questions that needs to be answered to help you prepare an oral presentations.

Who will attend the presentation is important. If you do a presentation in front of experts from your field the presentation should be different than if you present to your friend, your research advisor, or to some kids. A presentation should always be adapted to the audience.

To avoid having bad surprises, it is always better to check the equipment that will be available for your presentation and prepare some backup plan in case that some problems occur. For example, one may bring his laptop and a copy of his presentation on a USB drive as well as a copy in his e-mail inbox, just in case.

It is also important to know the expected length of the presentation. If the presentation at a conference should last no more than 20 minutes, for example, then one should make sure that the presentation will not last more than 20 minutes. At an academic conference, it is quite possible that someone will stop your presentation if you exceed the time limit. Moreover, exceeding the time limit may be seen as disrespectful.

Rule 2 :  Always look at your audience

When giving a presentation, there are a few important rules that should always be followed. One of the most important one is to always look at your audience when talking.

One should NEVER read the slides and turn his back to the audience for more than a few seconds. I have seen some presenters that did not look at the audience for long periods of time at academic conferences, and it is one of the best way to annoy the audience. For example, here are some pictures that I have took at an academic conference.

Not looking at the audience

Not looking at the audience

Turning your back to audience

Turning your back to audience

In that presentation, the presenter barely looked at the audience. Either he was looking at the floor (first picture) when talking or either he was reading the slides (second picture). This is one of the worst thing to do, and the presentation was in fact awful. Not because the research work was not good. But because it was poorly presented. To do a good presentation, one should try to look at the audience as much as possible. It is OK sometimes to look at a slide to point something out, but it should not be more than a few seconds.  Otherwise, the audience may lose interest in your presentation.

Personally, when I give a presentation, I look very quickly at the computer screen to see some keywords and remember what I should say, and then I look at the audience to talk. Then, when I go to the next slide, I will briefly look at the screen of my computer to remember what I should say on that slide and then I continue looking at the audience while talking. Doing this results in much better presentation. But it may require some preparation. If you practice giving your talk several times in your bedroom for example, then you will become more natural and you will not need to read your slides when it will be the time to give your presentation in front of an audience.

Rule 3:  Talk loud enough

Other important things to do is to talk LOUD enough when giving a presentation, and speak clearly. Make sure that even the people in the back of the room can hear your clearly. This seems like something obvious. But several times at academic conferences, there are some presenters who do not speak loud enough, and it becomes very boring for the audience, especially for those in the back of the room.

Rule 4:  Do not sit 

Another good advice is to stand when giving a presentation. I have ever seen some people giving a presentation while seated. In general, if you are seated, then you will be less “dynamic”. It is always better to stand up to give a presentation.

Rule 5:  Make simple slides

A very common problem that I observed in presentations at academic conferences is that presenters put way too much content on their slides. I will show you some pictures that I took at an academic conference for example:

Slides with too many formulas

A slide with too many formulas

In this picture, the problem is that there are too many technical details, and formulas. It is impossible for someone attending a 20 minutes presentations with slides full of formulas to read, understand and remember all these formulas, with all these symbols.

In general, when I give a presentation at a conference, I will not show all the details, formulas, or theorems.  Instead, I will only give some minimum details so that the audience understand the basic idea of my work: what is the problem that I want to solve and the main intuition behind the solution. And I will try to explain some applications of my work and show some illustrations or simple examples to make it easy to understand. Actually, the goal of a paper presentation is that the audience understand the main idea of your work.  Then, if someone from the audience wants to know all the technical details, he can read your paper.

If someone do like in the picture above by giving way too much technical details or show a lot of formulas during a presentations, then the audience will very quickly get lost in the details and stop following the presentation.

Here is another example:

Slides with too much content

A slide with way too much content

In the above slide, there are way too much text. Nobody from the audience will start to read all this text.  To make a good presentation, you should try to make your slides as simple as possible. You should also not put full sentences but rather just put some keywords or short parts of sentences. The reason is that during the presentation you should not read the slides and the audience should also not read a long text on your slides. You should talk and the audience should listen to you rather than be reading your slides.  Here is an example of a good slide design:

A powerpoint slide with just enough content

A powerpoint slide with just enough content

This slide has just enough content. It has some very short text that give only the main points. And then the presenter can talk to explain these points in more details while looking at the audience rather than reading the slides.

Conclusion

There are also a lot of other things that could be said about giving a good presentation but I did not want to write too much for today. Actually, giving good presentations is something that is learned through practice. The more that you practice giving presentations, the more that you will become comfortable to talk in front of many people.

Also, it is quite normal that a student may be nervous when giving a presentation especially if it is in a foreign language. In that case, it requires more preparation.

Hope that you have enjoyed this short blog post.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Conference, General, Research | 2 Comments

Brief report about the 12th International Conference on Machine Learning and Data Mining conference (MLDM 2016)

In this blog post, I will provide a brief report about the 12th Intern. Conference on Machine Learning and Data Mining (MLDM 2016), that I have attended from the 18th to 20th July 2016 in New York, USA.

About the conference
This is the 12th edition of the conference. The MLDM conference is co-located and co-organized with the 16th Industrial Conference on Data Mining 2016, that I have also attended this week. The proceedings of MLDM are published by Springer. Moreover, an extra book was offered containing two late papers, published by Ibai solutions.

The MLDM 2016 proceedings

The MLDM 2016 proceedings

Acceptance rate

The acceptance rate of the conference is about 33% (58 papers have been accepted from 169 submitted papers), which is reasonable.

First day of the conference

The first day of the MLDM conference started at 9:00 with an opening ceremony, followed by a keynote on supervised clustering. The idea of supervised clustering is to perform clustering on data that has already some class labels. Thus, it can be used for example to discover sub-class in existing classes. The class labels can also be used to evaluate how good some clusters are. One of the cluster evaluation measure suggested by the keynote speaker is the purity, that is the percentage of instances having the most popular class label in a cluster. The purity measure can be used to remove outliers from some clusters among other applications.

After the keynote, there was paper presentations for the rest of the day. Topics were quite varied. It included paper presentations about clustering, support vector machines, stock market prediction, list price optimization, image processing, automatic authorship attribution of texts, driving style identification, and source code mining.

The conference room

The conference room

The conference ended at around 17:00 and was followed by a banquet at 18:00. There was about 40 persons attending the conference in the morning. Overall, there was  some some interesting paper presentations and discussion.

Second day of the conference

The second day was also a day of paper presentations.

Second day of the conference

Second day of the conference (afternoon)

The topics of the second day included itemset mining algorithms, inferring geo-information about persons, multigroup regression, analyzing the content of videos, time-series classification, gesture recognition (a presentation by Intel) and analyzing the evolution of communities in social networks.

I have presented two papers during that day (one by me and one by my colleague), including a paper about high-utility itemset mining.

Third day of the conference

The third day of the conference was also paper presentations. There was various topics such as image classification, image enhancement, mining patterns in cellular radio access network data, random forest learning, clustering  and graph mining.

Conclusion

It was globally an interesting conference. I have attended both the Industrial Conference on Data Mining and MLDM conference this week. The MLDM is more focused on theory and the Industrial Conference on Data Mining conference is more focused on industrial applications. MLDM is a slightly bigger conference.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Conference, Data Mining, Data science | 1 Comment

Brief report about the 16th Industrial Conference on Data mining 2016 (ICDM 2016)

In this blog post, I will provide a brief report about the 16th Industrial Conference on Data mining 2016, that I have attended from the 13 to 14 July 2016 in New York, USA.

About the conference
This conference is an established conference in the field of data mining. It is the 16th edition of the conference. The main proceedings are published by Springer, which ensures that the conference proceedings are well-indexed, while the poster proceedings are published by  Ibai solutions.

The proceedings for full papers and posters

The acronym of this conference is ICDM (Industrial Conference on Data Mining). However, it should not be confused with IEEE ICDM (IEEE International Conference on Data Mining).

Acceptance rate

The quality of the papers at this conference has been fine. 33 papers have been accepted and it was said during the conference that the acceptance rate was 33%. It is interesting though that this conference attracts many papers related to industry applications due to the name of the conference. There was some paper presentations from IBM Research and Mitsubishi.

Conference location

In the past, this conference has been held mostly in Germany, as the organizer is German. But this year, it is organized in New York, USA, and more specifically at a hotel in Newark, New Jersey beside New York. It is interesting that the conference is collocated and co-organized with the MLDM conference. Hence, it is possible to spend 9 days in Newark to attend both conferences. About MLDM 2016, you can read my report about this conference here: MLDM 2016.

ccc

First day of the conference

The conference is two days (excluding workshops and tutorials which requires additional fees). The first day was scheduled to start at 9:00 and finish at around 18:00.

The conference room

The conference started by an opening ceremony. After the opening ceremony, there was supposed to be a keynote but the keynote speaker could unfortunately not attend the conference due to personal reasons.

Then, there was paper presentations. The main topics of the papers have been applications in medicine, algorithms for itemset mining, applying clustering to analyze data about divers, text mining, prediction of rainfall using deep learning, analyzing GPS traces of truck drivers, and preventing network attacks  in the cloud using classifiers. I presented a paper about high-utility itemset mining and periodic pattern mining.

After the paper presentations, there was a poster session. One of the poster applied data mining in archaeology, which is a quite unusual application of data mining.

The poster session

Finally, there was a banquet at the same hotel. The food was fine. Since, the conference is quite small, there was only three tables of 10 persons. But a small conference also means a more welcoming atmosphere and more proximity between participants, such that it was easy to discuss with everyone.

Second day of the conference

The second day of the conference started at 9:00. It consisted of paper presentations. The topics of the paper have covered topics such as association rules, alarm monitoring, distributed data mining, process mining, diabetes age prediction and data mining to analyse wine ratings.

The second day

Best paper award

One thing that I think is great about this conference is that there is a 500 € prize for the best paper. Not many conferences have a cash prize for the best paper award.  Also, there was three nominated papers for the best paper award, and each nominee got a certificate. As one of my paper got nominated, I got a nice certificate (though I did not receive the best paper award).

Best paper nomination certificate

Best paper nomination certificate

Conclusion

It was the first time that I have attended this conference. It is a small conference compared to some other conferences. It is not a first tier conference in data mining. But it is still published by Springer. At this conference, I did not met anyone working exactly in my field, but I still had the opportunity to meet some interesting researchers. The MLDM conference that I have also attended after that conference was bigger.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Conference, Data Mining | 3 Comments

The top journals and conferences in data mining / data science

A key question for data mining and data science researchers is to know what are the top journals and conferences in the field, since it is always best to publish in the most popular journals or conferences. In this blog post, I will look at four different rankings of data mining journals and conferences based on different criteria, and discuss these rankings.

TOPDM

1) The Google Ranking of data mining and analysis journals and conferences

A first ranking is the Google Scholar Ranking (https://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_datamininganalysis). This ranking is automatically generated based on the H5 index measure.  The H5 index measure is described as “the h-index for articles published in the last 5 complete years. It is the largest number h such that h articles published in 2011-2015 have at least h citations each“. The ranking of the top 20 conferences and journals is the following:

  Publication h5-index h5-median Type
1 ACM SIGKDD International Conference on Knowledge discovery and data mining 67 98 Conference
2 IEEE Transactions on Knowledge and Data Engineering 66 111 Journal
3 ACM International Conference on Web Search and Data Mining 58 94 Conference
4 IEEE International Conference on Data Mining (ICDM) 39 64 Conference
5 Knowledge and Information Systems (KAIS) 38 52 Journal
6 ACM Transactions on Intelligent Systems and Technology (TIST) 37 68 Journal
7 ACM Conference on Recommender Systems 35 64 Conference
8 SIAM International Conference on Data Mining (SDM) 35 54 Conference
9 Data Mining and Knowledge Discovery 33 57 Journal
10 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 30 56 Journal
11 European Conference on Machine Learning and Knowledge Discovery in Databases (PKDD) 30 36 Conference
12 Social Network Analysis and Mining 26 37 Journal
13 ACM Transactions on Knowledge Discovery from Data (TKDD) 23 39 Journal
14 International Conference on Artificial Intelligence and Statistics 23 29 Conference
15 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 22 30 Conference
16 IEEE International Conference on Big Data 18 30 Conference
17 Advances in Data Analysis and Classification 18 25 Journal
18 Statistical Analysis and Data Mining 17 30 Journal
19 BioData Mining 17 25 Journal
20 Intelligent Data Analysis 16 21 Journal

 

Some interesting observations can be made from this ranking:

  • It shows that some conferences in the field of data mining actually have a higher impact than some journals. For example, the well-known KDD conference is ranked higher than all journals.
  • It appears strange that the KAIS journal is ranked higher than DMKD/DAMI and the TKDD journals, which are often regarded as better journals than KAIS. However, it may be that the field is evolving, and that KAIS has really improved over the years.

2) The Microsoft Ranking of data mining conferences

Another automatically generated ranking is the Microsoft ranking of data mining conferences (http://academic.research.microsoft.com/RankList?entitytype=3&topdomainid=2&subdomainid=7&orderby=1). This ranking is based on the number of publications and citations.  Contrarily to Google, Microsoft has separated rankings for conferences and journals.

Besides, Microsoft offers two metrics for ranking :  the number of citations and the Field Rating. It is not very clear how the “field rating” is calculated by Microsoft. The Microsoft help center describes it as follows: “The Field Rating is similar to h-index in that it calculates the number of publications by an author and the distribution of citations to the publications. Field rating only calculates publications and citations within a specific field and shows the impact of the scholar or journal within that specific field“.

Here is the conference ranking of the top 30 conferences by citations:

Rank Conference Publications Citations
1 KDD – Knowledge Discovery and Data Mining 2063 69270
2 ICDE – International Conference on Data Engineering 4012 67386
3 CIKM – International Conference on Information and Knowledge Management 2636 28621
4 ICDM – IEEE International Conference on Data Mining 2506 18362
5 SDM – SIAM International Conference on Data Mining 708 9095
6 PKDD – Principles of Data Mining and Knowledge Discovery 994 8875
7 PAKDD – Pacific-Asia Conference on Knowledge Discovery and Data Mining 1255 6400
8 DASFAA – Database Systems for Advanced Applications 1251 4001
9 RIAO – Recherche d’Information Assistee par Ordinateur 574 3551
10 DMKD / DAMI – Research Issues on Data Mining and Knowledge Discovery 103 3264
11 DaWaK – Data Warehousing and Knowledge Discovery 503 2648
12 DS – Discovery Science 553 2256
13 Fuzzy Systems and Knowledge Discovery 4626 2171
14 DOLAP – International Workshop on Data Warehousing and OLAP 177 1830
15 IDEAL – Intelligent Data Engineering and Automated Learning 1032 1789
16 WSDM – Web Search and Data Mining 196 1499
17 GRC – IEEE International Conference on Granular Computing 1351 1434
18 ICWSM – International Conference on Weblogs and Social Media 238 1142
19 DMDW – Design and Management of Data Warehouses 70 993
20 FIMI – Workshop on Frequent Itemset Mining Implementations 32 849
21 MLDM – Machine Learning and Data Mining in Pattern Recognition 313 822
22 PJW – Workshop on Persistence and Java 41 712
23 ADMA – Advanced Data Mining and Applications 562 676
24 ICETET – International Conference on Emerging Trends in Engineering & Technology 712 376
25 WKDD – Workshop on Knowledge Discovery and Data Mining 527 342
26 KDID – International Workshop on Knowledge Discovery in Inductive Databases 70 328
27 ICDM – Industrial Conference on Data Mining 304 323
28 DMIN – Int. Conf. on Data Mining 434 278
29 MineNet – Mining Network Data 22 278
30 WebMine – Workshop on Web Mining 15 245

And here is the top 30 conferences by Field rating:

Rank Conference Publications Field Rating
1 KDD – Knowledge Discovery and Data Mining 2063 122
2 ICDE – International Conference on Data Engineering 4012 104
3 CIKM – International Conference on Information and Knowledge Management 2636 67
4 ICDM – IEEE International Conference on Data Mining 2506 56
5 SDM – SIAM International Conference on Data Mining 708 45
6 PKDD – Principles of Data Mining and Knowledge Discovery 994 40
7 PAKDD – Pacific-Asia Conference on Knowledge Discovery and Data Mining 1255 33
8 RIAO – Recherche d’Information Assistee par Ordinateur 574 28
9 DMKD / DAMI – Research Issues on Data Mining and Knowledge Discovery 103 27
10 DASFAA – Database Systems for Advanced Applications 1251 26
11 DaWaK – Data Warehousing and Knowledge Discovery 503 22
12 DOLAP – International Workshop on Data Warehousing and OLAP 177 22
13 DS – Discovery Science 553 20
14 ICWSM – International Conference on Weblogs and Social Media 238 19
15 WSDM – Web Search and Data Mining 196 19
16 DMDW – Design and Management of Data Warehouses 70 19
17 PJW – Workshop on Persistence and Java 41 16
18 FIMI – Workshop on Frequent Itemset Mining Implementations 32 14
19 GRC – IEEE International Conference on Granular Computing 1351 13
20 IDEAL – Intelligent Data Engineering and Automated Learning 1032 13
21 MLDM – Machine Learning and Data Mining in Pattern Recognition 313 13
22 Fuzzy Systems and Knowledge Discovery 4626 11
23 ADMA – Advanced Data Mining and Applications 562 10
24 KDID – International Workshop on Knowledge Discovery in Inductive Databases 70 10
25 ICDM – Industrial Conference on Data Mining 304 9
26 MineNet – Mining Network Data 22 9
27 ESF Exploratory Workshops 17 8
28 TSDM – Temporal, Spatial, and Spatio-Temporal Data Mining 13 8
29 ICETET – International Conference on Emerging Trends in Engineering & Technology 712 7
30 WKDD – Workshop on Knowledge Discovery and Data Mining 527 7

Some observations:

  • The ranking by citations and by field rating are quite similar.
  • The KDD conference is still the #1 conference.  This make sense, and also that the CIKM, ICDM and SDM conferences are among the top conferences in the field
  • PKDD is higher than PAKDD, which are higher than DASFAA and DAWAK as in the Google ranking, and I agree with this.
  • Some conferences were not in the Google ranking like ICDE. It may be because the Google ranking put the ICDE conference in a different category.
  • Microsoft rank DMKD / DAMI as a conference, while it is a journal.
  • The FIMI workshop is also ranked high although that workshops only occurred in 2003 and 2004. Thus, it seems that Microsoft has no restrictions on time. Actually, since the FIMI workshop was not help since 2004, it should not be in this ranking. The ranking would probably be better if Microsoft would consider only the last five years for example.

3)The Microsoft Ranking of data mining journals

Now let’s look at the top 20 data mining journals according to Microsoft, by citations.

Rank Journal Publications Citations
1 IPL – Information Processing Letters 7044 62746
 2 TKDE – IEEE Transactions on Knowledge and Data Engineering 2742 60945
3 CS&DA – Computational Statistics & Data Analysis 4524 24716
4 DATAMINE – Data Mining and Knowledge Discovery 584 19727
5 VLDB – The Vldb Journal 631 17785
6 Journal of Knowledge Management 747 9601
7 Sigkdd Explorations 491 9564
8 Journal of Classification 550 8041
9 KAIS – Knowledge and Information Systems 741 7639
10 WWW – World Wide Web 540 7182
11 INFFUS – Information Fusion 567 5617
12 IDA – Intelligent Data Analysis 477 4167
13 Transactions on Rough Sets 221 1653
14 JECR – Journal of Electronic Commerce Research 122 1577
15 TKDD – ACM Transactions on Knowledge Discovery From Data 110 716
16 IJDWM – International Journal of Data Warehousing and Mining 102 366
17 IJDMB – International Journal of Data Mining and Bioinformatics 132 256
18 IJBIDM – International Journal of Business Intelligence and Data Mining 124 251
19 Statistical Analysis and Data Mining 124 169
20 IJICT – International Journal of Information and Communication Technology 111 125

And here is the top 20 journals by Field Rating.

Rank Journal Publications Field Rating
1 TKDE – IEEE Transactions on Knowledge and Data Engineering 2742 109
 2 IPL – Information Processing Letters 7044 80
3 VLDB – The Vldb Journal 631 61
4 DATAMINE – Data Mining and Knowledge Discovery 584 57
5 Sigkdd Explorations 491 50
6 CS&DA – Computational Statistics & Data Analysis 4524 49
7 Journal of Knowledge Management 747 46
8 WWW – World Wide Web 540 42
9 Journal of Classification 550 37
10 INFFUS – Information Fusion 567 36
11 KAIS – Knowledge and Information Systems 741 33
12 IDA – Intelligent Data Analysis 477 28
13 JECR – Journal of Electronic Commerce Research 122 21
14 Transactions on Rough Sets 221 20
15 TKDD – ACM Transactions on Knowledge Discovery From Data 110 13
16 IJDMB – International Journal of Data Mining and Bioinformatics 132 8
17 IJDWM – International Journal of Data Warehousing and Mining 102 8
18 IJBIDM – International Journal of Business Intelligence and Data Mining 124 7
19 Statistical Analysis and Data Mining 124 7
20 IJICT – International Journal of Information and Communication Technology 111 5

Some observations:

  • The ranking by citations and field rating are quite similar.
  • The TKDE journal is again in the top of the ranking, just like in the Google ranking.
  • It make sense that the VLDB journal is quite high. This journal was not in the Google ranking probably because it is more a database journal than a data mining journal.
  • Sigkdd explorations is also a good journals, and it make sense to be in the list. However, I’m not sure that it should be higher than TKDD and DMKD / DAMI.
  • The KAIS journal is still ranked quite high. This time it is lower than DMKD / DAMI (unlike in the Google Ranking) but still higher than TKDD. This is quite strange. Actually, TKDD is arguably a better journal. As explained in the comment section of this blog post, a reason why KAIS is ranked so high may be because in the past, the journal has encouraged authors to cite papers from the KAIS journal. Besides, it appears that the Microsoft ranking has no restriction on time (it does not consider only the last five years for example).
  • It is also quite strange that “Intelligent Data Analysis” is ranked higher than TKDD.
  • Some journals like WWW and JECR should perhaps not be in this ranking. Although they publish data mining papers, they do not exclusively focus on data mining. And this is probably the reason why they are not in the Google ranking. On overall, the Microsoft ranking seems to be broader than the Google ranking.

4) Impact factor ranking

Now, another popular way of ranking journals is using their impact factor (IF). I have taken some of the top data mining journals above and obtained their Impact Factor from 2014/2015 or 2013, when I could not find the information for 2015. Here is the result:

Journal Impact factor
DMKD/DAMI Data Mining and Knowledge Discovery 2.714
IEEE Transactions on Knowledge and Data Engineering 2.476
Knowledge and Information Systems 1.78
VLDB – The Vldb Journal 1.57
TKDD – ACM Transactions on Knowledge Discovery From Data 1.14
Advances in Data Analysis and Classification 1.03
Intelligent Data Analysis 0.50

Some observations:

  • TKDE and DAMI/DMKD are still among the top journal
  • As in the Microsoft ranking, DAMI/DMKD is above KAIS, which is above TKDD.
  • As pointed out in the comment section of this blog post, it is strange that KAIS is so high, compared for example to TKDD, or VLDB, which is a first-tier database journal. This shows that IF is not a perfect metric.
  • Compared to the Microsoft Ranking, the IF ranking at least has the “Intelligent Data Analysis” journal much lower than TKDD.  This make sense, as TKDD is a better journal.

Conclusion

In this blog post, we have looked at three different rankings of data mining journals and conferences:  the Microsoft ranking, the Google ranking, and the Impact Factor ranking.

All these rankings are not perfect. They are somewhat accurate but they may not always correspond to the actual reputation in the data mining field.  The Google ranking is more focused on the data mining field, while the Microsoft ranking is perhaps too broad, and seems to have no restriction on time. Also, as it can be seen by observing these rankings, different measures yield different rankings. However, there are still some clear trends in these ranking such as TKDE being ranked as one of the top journal and KDD as the top conference in all rankings. The top journals and conferences are more or less the same in each ranking.  But there are also some strange ranks such as KAIS and Intelligent Data Analysis being ranked higher than TKDD in the Microsoft ranking.

Do you agree with these rankings?  Please leave your comments below!

Update 2016-07-19:  I have updated the blog post based on the insightful comments made by Jefrey in the comment section. Thanks!

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Data Mining, Data science, Research | 3 Comments

An introduction to periodic pattern mining

In this blog post I will give an introduction to the discovery of periodic patterns in data. Mining periodic patterns is an important data mining task as patterns may periodically appear in all kinds of data, and it may be desirable to find them to understand  data for taking strategic decisions.

clocks

For example, consider customer transactions made in retail stores. Analyzing the behavior of customers may reveal that some customers have periodic behaviors such as buying some products every week-end such as wine and cheese. Discovering these patterns may be useful to promote products on week-ends, or take other marketing decisions.

Another application of periodic pattern mining is stock market analysis. One may analyze the price fluctuations of stocks to uncover periodic changes in the market. For example, the stock price of a company may follow some patterns every month before it pays its dividend to its shareholders, or before the end of the year.

In the following paragraph, I will first present the problem of discovering frequent periodic patterns, periodic patterns that appear frequently. Then, I will discuss an extension of the problem called periodic high-utility pattern mining that aims at discovering profitable patterns that are periodic. Moreover, I will present open-source implementations of these algorithms.

The problem of mining  frequent periodic patterns

The problem of discovering periodic patterns can be generally defined as follows. Consider a database of transactions depicted below. Each transaction is a set of items (symbols), and transactions are ordered by their time.

A transaction database

A transaction database

Here, the database contains seven transactions labeled T1 to T7. This database format can be used to represent all kind of data. However, for our example, assume that it is a database of customer transactions in a retail store. The first transaction represents that a customer has bought the items a and  c together. For example, could mean an apple, and c could mean cereals.

Having such a database of objects or transactions, it is possible to  find periodic patterns. The concept of periodic patterns is based on the notion of period.

A period is the time elapsed between two occurrences of a pattern.  It can be counted in terms of time, or in terms of a number of transactions. In the following, we will count the length of periods in terms of number of transactions. For example, consider the itemsets (set of items) {a,c}. This itemset has five periods illustrated below. The number that is used to annotate each period is the period length  calculated as a number of transactions.

periodcsac

The first period of {a,c} is what appeared before the first occurrence of {a,c}. By definition, if {a,c} appears in the first transaction of the database, it is assumed that this period has a length of 1.

The second period of {a,c} is the gap between the first and second occurrences of {a,c}. The first occurrence is in transaction T1 and the second occurrence is in transaction T3. Thus, the length of this period is said to be 2 transactions.

The third period of {a,c} is the gap between the second and third occurrences of {a,c}. The second occurrence is in transaction T3 and the third occurrence is in transaction T5. Thus, the length of this period is said to be 2 transactions.

The fourth period of {a,c} is the gap between the third and fourth occurrences of {a,c}. The third occurrence is in transaction T5 and the fourth occurrence is in transaction T6. Thus, the length of this period is said to be 2 transactions.

Now, the fifth period is interesting. It is defined as the time elapsed between the last occurrence of {a,c} (in T6) and the last transaction in the database, which is T7. Thus, the length of this period is also 1 transaction.

Thus, in this example, the list of period lengths of the pattern {a,c} is: 1,2,2,1,1.

Several algorithms have been designed to discover periodic patterns in such databases, such as the PFP-Tree, MKTPP, ITL-Tree, PF-tree,  and MaxCPF algorithms. For these algorithms, a pattern is said to be periodic, if:
(1) it appears in at least minsup transactions, where minsup is a number of transactions set by the user,
(2) and the pattern has no period of length greater than a maxPer parameter also set by the user.

Thus, according to this definition if we consider minsup = 3 and maxPer =2, the itemset {a,c} is said to be periodic because it has no period of length greater than 2 transactions, and it appears in at least 3 transactions.

This definition of a periodic pattern is, however, too strict. I will explain why with an example. Assume that maxPer is set to 3 transactions. Now, consider that a pattern appears every two transactions many times but that only once it appears after 4 transactions. Because the pattern has a single period greater than maxPer, this pattern would be automatically be deemed as non periodic although it is in general periodic. Thus, this definition is too strict.

Thus, in a recent paper, we proposed a solution to this issue. We introduced two new measures called the average periodicity and the minimum periodicity. The idea is that we should not discard a pattern if it has a single period that is too large but should instead look at how periodic the pattern is on average. The designed algorithm is called PFPM. It discovers periodic patterns using a more flexible definition of what is a periodic pattern. A pattern is said to be periodic pattern if:
(1) the average length of its periods denoted as avgper(X) is not less than a parameter minAvg and not greater than a parameter maxAvg.
(2) the pattern has no period greater than a maximum maxPer.
(3) the pattern has no period smaller than a minimum minPer.

This definition of a periodic pattern is more flexible than the previous definition and thus let the user better select the periodic patters to be found. The user can set the minimum and maximum average as the main constraints for finding periodic patterns and use the minimum and maximum as loose constraints to filter patterns having periods that vary too widely.

Based on the above definition of periodic patterns, the problem of mining all periodic patterns in a database is to find all periodic patterns that satisfy the constraints set by the user. For example, if minPer = 1, maxPer = 3, minAvg = 1, and maxAvg = 2, the 11 periodic patterns found in the example database are the one shown in the table below. This table indicates the measures of support (frequency), minimum, average and maximum periodicity for each pattern found:

calculations

As it can be observed in this example, the average periodicity can give a better view of how periodic a pattern is. For example, considers the patterns {a,c} and {e}. Both of these patterns have a largest period of 2 (called the maximum periodicity), and would be considered as equally periodic using the standard definition of a periodic pattern. But the average periodicity of these two patterns is quite different. The average periodicity measure indicates that on average {a,c} appears with a period of 1.4 transactions, while {e} appears on average with a period of 1.17 transaction.

Discovering periodic patterns using the SPMF open-source data mining library

An implementation of the PFPM algorithm for discovering periodic patterns, and its source code, can be found in the SPMF data mining software. This software allows to run the algorithm from a command line or graphical interface. Moreover, the software can be used as a library and the source code is also provided under the GPL3 license. For the example discussed in this blog post, the input database is a text file encoded as follows:

3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

where the numbers 1,2,3,4,5 represents the letters a,b,c,d,e. To run the algorithm from the source code, the following lines of code need to be used:

String output = "output.txt";
 String input = "contextPFPM.txt";
 int minPeriodicity = 1; 
 int maxPeriodicity = 3; 
 int minAveragePeriodicity = 1; 
 int maxAveragePeriodicity = 2; 

// Applying the algorithm
 AlgoPFPM algorithm = new AlgoPFPM();
 algorithm.runAlgorithm("input.txt", "output.txt",  minPeriodicity, maxPeriodicity, minAveragePeriodicity, maxAveragePeriodicity);

The output is then a file containing all the periodic patterns shown in the table.

2 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 5 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 5 3 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 3 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
4 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
4 3 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
1 #SUP: 4 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.4
1 3 #SUP: 4 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.4
5 #SUP: 5 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.1666666666666667
5 3 #SUP: 4 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.4
3 #SUP: 6 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.0

For example, the eighth line represents the pattern {a,c}. It indicates that this pattern appears in four transactions, that the smallest and largest periods of this pattern are respectively 1 and 2 transactions,  and that this pattern has an average periodicity of 1.4 transactions.

Related problems

Note that there also exists extensions of the problem of discovering periodic patterns. For example, another algorithm offered in the SPMF library is called PHM. It is designed to discover “periodic high-utility itemsets” in customer transaction databases. The goal is not only to find patterns that appear periodically, but also to discover the patterns that yields a high profit in terms of sales.

Conclusion

In this blog post, I have introduced the problem of discovering periodic patterns in databases. I have also explained how to use open-source software to discover periodic patterns. Mining periodic patterns is a general problem that may have many applications.

There are also many research opportunities related to periodic patterns, as this subject as not been extensively studied. For example, some possibility could be to transform the proposed algorithms into an incremental algorithm, a fuzzy algorithm, or to discover more complex types of periodic patterns such as periodic sequential rules.

Also, in this blog post, I have not discussed about time series (sequences of numbers). It is also another interesting problem to discover periodic patterns in time series, which requires different types of algorithms.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Posted in Big data, Data Mining, Data science, open-source, Research, Uncategorized, Utility Mining | 2 Comments

Full-time faculty positions at Harbin Institute of Technology (data mining, statistics, psychology, design…)

Full-time faculty positions at Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China

hit1

The Center of Innovative Industrial Design is currently looking to hire full-time faculty members at the rank of assistant professor, associate professor or professor, with expertise in one of the four following research directions:

RESEARCH DIRECTIONS

1: Industrial product innovation and user perception Detection (HITSZ16-1601001)

The applicants should have a degree in computer science (e.g. artificial intelligence, data mining, machine learning, etc.), psychology (such as cognitive psychology, engineering psychology, quantitative research or psychology measurement, etc.), statistics, or other major related with quantitative research. The applicant should have a doctorate degree from a well-known university. If the applicant has a doctor degree from a domestic university, postdoctoral work experience overseas is preferred. Applicants who have the following professional or interdisciplinary background are preferred: physiological signal sensing and networks construction, physiological and psychological parameters testing and modeling, human-computer interaction and virtual reality, data visualization design, natural language processing, cognitive science and Neurology, etc. Candidates should have an excellent scientific research capability and be fluent in English teaching and communication.

2: Industrial Product Design of human-computer interface (HITSZ16-1601002)

The applicants should have a doctorate degree in mechanical or industrial design, computer science -ergonomics, digital media design or some related fields. The degree should have been obtained from a well-known university at home or abroad. If the doctorate degree has been obtained from a domestic university, postdoctoral work experience overseas is preferred. The candidate should have research experience in -human-computer interface design, including virtual or reality product interface design. The candidate should be proficient in computer programming or utilizing digital product design software. The candidate should have research experience in user research, interface research, and product-related sociological research. The candidate should be familiar with academic frontiers in human-computer interaction area, have excellent scientific research abilities, preferably have experience in interdisciplinary projects, and be fluent in English teaching and communication.

3: Industrial Environmental Design and Digital Research (HITSZ16-1601003)

The candidate should hold a doctorate degree in design from a well-known university. If the candidate holds a doctorate degree from a domestic university, postdoctoral work experience abroad will be preferred. The applicant should have interests in urban and industrial space design, architectural and spatial environment design, transport, equipment operation and other interior design, furniture, lighting, furnishing design and display design, etc.. The applicant should have a solid scientific research capability, be able to work in an interdisciplinary research environment, and be fluent in English teaching and communication.

4: Industrial product modeling and convey visual and auditory research (HITSZ16-1601004)

The candidate should hold a doctorate degree in design, or a related professional doctorate degree in fine arts from a well-known university at home or abroad. If the candidate holds a doctorate degree from a domestic university, postdoctoral work experience abroad will be preferred. The candidate should have professional interest in audio-visual communication design and practice, industrial product design, web design and digital media interface design, etc. The candidate should have excellent scientific research capability, be able to work in an interdisciplinary environment, be fluent in English for teaching and communication, and have strong creative design capabilities.

JOB RESPONSIBILITIES

1. Teaching: The selected candidates will participate in the development of the discipline. S/he will develop and teach graduate and undergraduate courses, and supervise graduate students.

2. Research: The selected candidate should have a clear research direction, and be an active researcher for managing national, provincial, or municipal research projects or business research projects. He/she should have publications such as monographs, textbooks or papers in top journals as first author or corresponding author.

3. Discipline Construction: Work actively in discipline Construction, chair or participate in the discipline or teaching team.

4. Experiment: the candidate should actively participate in the development of the center

5. Other: it is expected that the selected candidate will carry out international cooperation and academic exchange activities, will chair or participate in international academic conferences, to expand the visibility and influence of the center and school; and that the selected candidate will actively participate in community service work, and perform other tasks stipulated in his/her contract.

HOW TO APPLY

To apply for the position(s), the candidate should submit the following application materials:

1. Application Letter
a) self-evaluation of professional quality, ability, and character
b) personal statement about research and teaching ability
c) work plan for this position
2. The Teaching and Research Job Application Form
3. CV in English and Chinese (if possible)

IMPORTANT NOTE: Please email the contacts below with application materials as attachments and specify the applied position ID + your name.

Human resource contact person:

Ms. Wang, e-mail: hr@hitsz.edu.cn.

Innovative Industrial Design Research Center Contact:
Dr. Yao, e-mail: julie_j_yao@163.com

Posted in artificial intelligence, Big data, Data Mining, Data science, Research | Leave a comment