How journal paper similarity checking works? (CrossCheck)

In this blog post, I will talk about the recent trends of journal editors rejecting papers because of their similarity with other papers as detected by the CrossCheck system. I will explain how this system works and talk about its impact, benefits and drawbacks.

similarity checking

What is similarity checking?

Nowadays, when an author submit a paper to a well-known academic journal  (from publishers such as Springer, Elsevier and IEEE), the editor will first submit the paper to an online system to check if the paper contain plagiarism. That system will compare the paper with papers from a database created by various publishers and websites to check if the paper is similar to some existing documents. Then, a report is provided to the journal editor indicating if there is some similarity with existing documents. In the case where the similarity is high or some key parts have clearly plagiarised from other authors, the editor will typically reject the paper. Otherwise, the editor will submit the paper to reviewers and then start the normal review process.

Why checking the similarity with other papers?

There are two reasons why editors perform this similarity check:

  • to quickly detect plagiarized papers that should clearly not be published.
  • to check if a paper from an author is original (i.e. if it is not too similar to previous papers from the same author).

In the second case, some journal editors will say for example that the “similarity score” should be below 20% or 40%, depending on the journal. Thus, under this model, an author is allowed to reuse just little bit of text in his own papers.

How does it works?

Now you perhaps wonder how that similarity score is calculated. Having access to some similarity report generated by the CrossCheck system, I will provide information about what these reports looks like and then explain some key aspects of this system.

After the editor submit a paper to CrossCheck, he receives a report. This report contains a summary page that looks like this:

Part of a CrossCheck similarity report

This report gives an overall similarity score of 32%. It can be interpreted as that on overall 32 % of the content of the text  matches with existing documents. It is furthermore said that 4 % is a match with internet sources, 31% with some other publications and 2% with student papers.  And as it can be observed,  31% + 2% + 4 % does not add up to 32%.  Why?  Actually, the calculation of the similarity score is misleading. Although I do not have access to the formula or the source code of the system, I found some explanation online. It is that the similarity score  is computed by matching each  part of a text with at most one document. In other words, if some paragraph of a submitted paper match with two existing documents, this paragraph will be counted only once in the overall score of 32 %.

An annotated pdf is also provided to the editor highlighting the parts that are matching with existing documents. For example, I show some page of such report below, where I have blurred the text for anonymization:

Detailed similarity comparison (blurred)

Detailed similarity comparison (blurred)

In such report, matching parts are highlighted in different colors and some number indicates which documents has been matched to which part of the text.

Limitations of this similarity checking systems

I will now describe some problems that I have observed about the report made by this similarity checking software:

  • In the above report, the countries of authors and their affiliations is considered as matching with their previous documents, which increases the similarity score. But obviously, this should not be taken into account. Can we blame an author for using the same affiliations in two papers?
  • Keywords are also considered as matching with previous documents. But I don’t think that using some of the same keywords as another paper should not be an issue.
  • Some of the matches are some very generic expressions or sentences used in many papers such as “explained in the next section” or “this paper is organized as follows”.
  • Another limitation is that this similarity checks completely ignores all the figures or illustrations. Thus if an author extends a conference paper as a journal papers and adds many figures for experiments to further differentiate his two papers, these figures will be completely ignored for calculating the similarity score.
  • Actually, the similarity checking system is limited to the text content of the paper. It can check the main text and the text in the tables, algorithms, math formulas, biography and affiliations. But it cannot check the text in figures that are included as bitmaps (pictures) in a paper. For example, if one includes an algorithm in a paper as a bitmap instead of as text, then the system will ignore that content. The system will only be able to compare the labels of the figures and not their content. Thus, an author with malicious intent could easily hide content from the matching system by transforming some content of an article as bitmap.
  • In the report that I have analyzed, I have found that  the bibliography is also considered when computing the similarity score. Obviously this seems quite unfair. Citing the same references as some other papers (especially when it is from the same author) is not plagiarism. In the case of the report that I have read, about 90% of the references were considered as matching those of several other documents, which increased the similarity score by probably at least 10%. But I have noticed that the editor can deactivate this function.
  • I have also observed that the system can also match the biography of authors at the end of the paper and the acknowledgements with those of their previous papers. This is also a problem. It is clearly not plagiarism to reuse the same biography or acknowledgement in two papers. But in that system, it increases the similarity score.

Thus, my opinion is that this system is quite imperfect. And in fact, it is not claimed that it is a perfect system.

What are the impact of this system?

The major impact is that many plagiarized papers can be detected early which is a good thing, as detecting these papers can save a lot of time to editor and reviewers.

However, a drawback of this system is that these metrics are clearly imperfect and there is a real danger that some editor just check the similarity score to take a decision on a paper and do not read a report carefully. For example, I have heard that some journals simply apply some arbitrary  thresholds such as rejecting all papers with a score >= 30 %. This is in my opinion a problem if that threshold is too low because in some cases it is justified that an author reuses text from his own previous papers. For example, an author may want to reuse some basic problem definitions from his own paper in a second paper with a different contributions. Or an author may want to extend a conference into a journal paper with some new contributions. In such case, I think that accepting some overlap between papers is reasonable.

A few years ago, when such system were not in use, it was quite common that some authors would extend a conference paper into a journal paper by adding 50 % more content. Today, with this system, I think that this may not be allowed anymore, maybe forcing authors to avoid publishing early results in conference papers (or otherwise having to spend extra time to rewrite their paper in a different way to extend it as a journal paper).

Another aspect is that such system needs to create a database of all papers. But should the authors have to agree so that their papers are put in this database?  Probably not because when a paper is published, the authors typically have to give the copyright to the publisher. Thus, I guess that the publisher is free to share the paper with such similarity checking service. But still, it raises some questions. If we make a comparison, there exists a homework plagiarism checking system called TurnItIn. This system have been actually legally challenged in the US and Canada, where some students have won some court battle so that their homework are not submitted/included in the system. Although, it is a slightly different situation, we could imagine that some people may also want to challenge journal similarity checking systems.

How to get a similarity checking report for your paper?

Checking the similarity of a paper is not free. However, editors or associate editors of journals have a subscription to use the similarity checking service. Thus, if you know an editor or associate editor that has a subscription, he may perhaps be able to generate a report for your paper for free. Otherwise, one could pay to obtain the service.


In this blog post, I provided an overview of the similarity checking system called CrossCheck used by several publishers and journals. I also talked about how scores appear to be computed and some limitations of this system, and its impact in the academic world.  Hope this has been interesting. Please share your comments in the comment section below.

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Skills needed for a data scientists? (comments on the HBR article)

Recently, I have read an article of the Harvard Business Review (HBR) website about data sciences skills for businesses. This article proposes to categorize skills related to data on a 2×2 matrix where skills are labelled as useful VS not useful, and time-consuming VS not time-consuming. The author of that article has drawn such a 2×2 matrix illustrating the needs of his team (see below).

Obtained from Harvard Business Review

This matrix has received many negative comments online, in the last few days. These comments have mainly highlighted two problems:

  • Why mathematics and statistics are viewed as useless?
  • Data science is viewed as useful but mathematics and statistics are viewed as useless, which is strange since math and stats are part of data science.

Having said that, I also don’t like this chart. And many people have asked why it is published in Harvard Business Review (a good magazine). But  we should keep in mind that this chart illustrates the needs of a company. Thus, it does not claim that mathematics and statistics are useless for everyone. It is quite possible that this company does not see any benefits in taking mathematics and statistics courses or training. Following the negative comments, the author and editor of HBR have reworded some parts of the article to try  to make clearer that this should be interpreted as a case study.

A part of the problem related to this chart and article is that the term “data science” has always been very ambiguous. Some people with very different backgrounds and doing very different things call themselves data scientists. This is a reason why I usually don’t use this term. And it could be a part of the reason why this chart shows a distinction between data science, math and stats, which I would describe as overlapping.

From a more abstract perspective, this article highlights that some companies are not interested into investing into skills that takes too much time to acquire (have no short-term benefits).  For example, I know that some companies prefer to use code from open-source projects or ready-made tools to analyze data rather than spending time to develop customized tools to solve problems. This is understandable as the goal of companies is to earn money and there are many tools available for data analysis.  However, one should not forget that using these tools often requires to possess an appropriate background in mathematics, statistics or computer science to choose an appropriate model given its assumptions and correctly interpret the results. Thus having those skills that take more times to acquire is also important.

What is your opinion about this chart and the most important skills for a data science?  Please share your opinion in the comment section below.

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

(video) Minimal High Utility Itemset Mining with MinFHM

This is a video presentation of the paper “Mining Minimal High Utility Itemsets” about high utility itemset mining using MinFHM. It is the first video of a series of videos that will explain various data mining algorithms.

(link to download the video if the player does not work)

More information about the MinFHM algorithm are provided in this research paper:

Fournier-Viger, P., Lin, C.W., Wu, C.-W., Tseng, V. S., Faghihi, U. (2016). Mining Minimal High-Utility Itemsets. Proc. 27th International Conference on Database and Expert Systems Applications (DEXA 2016). Springer, LNCS, 13 pages, to appear

The source code and datasets of the MinFHM algorithm can be downloaded here:

The source code of MinFHM and datasets are available in the SPMF software.

I will post videos like that perhaps once every few weeks.  I actually have a lot of PPTs to explain various algorithms on my computer but I just need to find time to record the videos.  In a future blog post, I will also explain which software and equipment can be used to record such videos. This is the first video, so obviously it is not perfect. I will make some improvements in the following videos.  If you have any comments, please post it in the comment section!

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Expensive Academic Conferences – the case of ICDM

I was recently thinking of attending IEEE ICDM 2018 (International Conference on Data Mining) in Singapore, next month. It is a top 5 data mining conference. According to my schedule, I could attend it for 2 days, and since Singapore is close to China, it is convenient to go there. However,  I was quite surprised by how expensive the registration fee of this conference has became. As of today the “standard registration fee (by 28 October)” is roughly 1360 USD$ or 9300 CNY.

icdm registration fee 2018

Registration fees  from ICDM2018 website

This is actually the most expensive conference that I have ever considered attending. Most conferences that I have attended have been in the 300-700 USD range, twice less than ICDM. But is it an outlier? To see more clearly, I decided to compare the standard registration of ICDM 2018 with those of previous editions of ICDM:

  • ICDM 2018: 1360 $ USD (11 % increase from 2017)
  • ICDM 2017: 1220 $ USD  (12% increase from 2015)
  • ICDM 2015: 1080 $USD  (28 % increase from 2013)
  • ICDM 2013: 844 $USD   (68% increase from 2011)
  • ICDM 2011: about 500 $ USD

This is quite interesting. It shows a steady increase in the registration price of the ICDM conference over the years. The registration fee has increased so much, that the price is now 2.7 times higher than 8 years ago!

Why is it so expensive?

One could argue that the reason is the location of the conference. But the increase has been steady over the years no matter where the conference was organized. Moreover, such big conferences have often thousands of attendees, and usually many sponsors. I recently attended the KDD 2018 conference, which was also expensive, but less than ICDM. There was about more than 3000 attendees, and if I remember they received more than 1 million dollars in sponsorship.

Thus, where all this money goes?  A good part goes to renting a convention center, publishing the proceedings and other aspects such as providing scholarships to students. But many conferences also make some considerable profit.  Some conferences are not for profit, while some other conferences will pay the local organizers or the association organizing the conference. I am not sure about how the money is used in the case of ICDM or IEEE and what they will do with the profits, as I could not find the information. But I believe that such big conferences can generate a huge amount of money. By discussing with organizers of smaller conferences (200 attendees) that have much lower registration fees and less sponsorship, I know that some conferences can still make 20,000$ profit.

About IEEE, it is not their only conference in the 1000$ USD range. Some other flagship conferences like IEEE ICC (about communication) also have fees greater than 1000$ USD.  In the field of data mining, the KDD conference is also quite expensive, although currently less than ICDM.  In some ways, many people want to attend these conferences so they are willing to pay these high fees.

Consequences of high registration fees

The consequence of such high registration fees is that some people may not have enough money to attend, and that a lot of money is spent by researchers.  And in many case, that money comes from research projects funded by the government. Thus, one could argue that this money could be used in better ways.

Personally, I was thinking of attending ICDM but when I saw that I would have to pay almost 1400 $ USD for two days to access the conference, I think it is not reasonable to spend that much money. I have enough research funding to pay this but I still do not want to waste the money provided by the government for supporting research. Thus, this year, I will use the money for other things rather than going to ICDM.

Update 2019-03-14: One of the general co-chairs of ICDM 2018 has taken the time to provide his insights and given some explanations about the registration fees of ICDM 2018 in the comment section. You can read the comment. It says that basically, the increase in price would be partially explained by fluctuations of the exchange rate, and the 7% sale tax of Singapore.

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Periodic patterns in Web log time series

Recently, I have analysed trends about visitors on this blog. I have made two observations. First, there is about 500 to 1000 visitors per day. For this, I want to thank you all for reading and commenting on the blog.  Second, if we look carefully at the number of visitors per day, it becomes a time series, and we can clearly see some patterns that is repeating itself every week. Below is a picture of this time series for January 2018.

periodic visitor accesses

As you can see, there is a clear pattern every week. Toward the beginning of the week on Monday and Tuesday, the number of visitor increases, while around Friday it starts to decrease. Finally, on Saturday and Sunday, there is a considerable decrease, and then it increases again on Monday. This pattern is repeating itself every week. We can see it visually, but such patterns could be detected using time series analysis techniques such as an autocorrelation plot. Besides, it would be easy to predict this time series using time series forecasting models.

We can also see a relationship with the concept of  periodic patterns that I have previously discussed in this blog. A periodic pattern is pattern that is always repeating itself over time.  That is all for today. I just wanted to shared this interesting finding.

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Upcoming book: High Utility Itemset Mining: Theory, Algorithms and Applications

I am happy to announce that the draft of the book about high utility pattern mining has been finalized and submitted to the publisher (Springer). It should thus be published in the very near future.

high utility pattern mining

The book contains 12 chapters written by several top researchers from the field of pattern mining, for a total of 350 pages. The title is “High Utility Itemset Mining: Theory, Algorithms and Applications”. It discuss high utility itemset mining and other related topics. I show you here the table of content:

Editors: Philippe Fournier-Viger, Jerry Chun-Wei Lin, Bay Vo, Roger Nkambou, Vincent S. Tseng.

  • Chapter 1: A Survey of High Utility Itemset Mining
    Philippe Fournier-Viger, Jerry Chun-Wei Lin, Tin Truong Chi, Roger Nkambou
    This chapter gives a more than 39 pages introduction to high utility pattern mining, designed for getting a quick overview of the field and the main results.
  • Chapter 2: A Comparative Study of Top-K High Utility Itemset Mining Methods
    Srikumar Krishnamoorthy
    This chapter gives an in-depth discussion of top-k high utility itemset mining, including a very detailed comparison of the state-of-the-art algorithms.
  • Chapter 3: A Survey of High Utility Pattern Mining Algorithms for Big Data
    Morteza Zihayat, Methdi Kargar, Jaroslaw Szlichta
    This chapter reviews algorithms for mining high utility patterns in big data.
  • Chapter 4: A survey of High Utility Sequential Pattern Mining
    Tin Truong Chi, Philippe Fournier-Viger
    This chapter provides a survey of  high utility sequential pattern mining. It contains several new theoretical results and a very detailed comparison of upper-bounds and algorithms.
  • Chapter 5: Efficient Algorithms for High Utility Itemset Mining without Candidate Generation
    Jun-Feng Qu, Mengchi Liu, Philippe Fournier-Viger
    This chapter presents the HUI-Miner algorithm and a novel extension called HUI-Miner*, which improves is performance in many situations.
  • Chapter 6: High Utility Association Rule Mining
    Loan T.T. Nguyen, Thang Mai, Bay Vo
    This discusses another important topic of discovering high utility associations.
  • Chapter 7: Mining High-utility Irregular Itemsets
    Supachai Laoviboon, Komate Amphawan
    This chapter considers the time dimension in high utility itemset mining to find regular patterns.
  • Chapter 8: A survey of Privacy Preserving Utility Mining
    Duy-Tai Dinh, Van-Nam Huynh, Bac Le, Philippe Fournier-Viger, Ut Huynh, Quang-Minh Nguyen
    This chapter provides an overview of techniques for hiding high utility patterns for privacy purposes.
  • Chapter 9: Extracting Potentially High Profit Product Feature Groups by Using High Utility Pattern Mining and Aspect based Sentiment Analysis
    Seyfullah Demir, Oznur Alkan, Firat Cekinel, Pinar Karagoz
    This section presents an interesting application of high utility pattern mining related to sentiment analysis
  • Chapter 10: Metaheuristics for Frequent and High-Utility Itemset Mining
    Youcef Djenouri, Philippe Fournier-Viger, Asma Belhadi, Jerry Chun-Wei Lin
    This chapter provides a survey of evolutionary and swarm intelligence algorithms for high utility itemset mining.
  • Chapter 11: Mining Compact High Utility Itemsets without Candidate Generation
    Cheng-Wei Wu, Philippe Fournier-Viger, Jia-Yuan Gu, Vincent S. Tseng
    This chapter presents algorithms for mining closed and maximal high utility itemsets. It includes a novel strategy for identifying maximal patterns when using a depth-first search.
  • Chapter 12: Visualization and Visual Analytic Techniques for Patterns
    Wolfgang Jentner and Daniel A. Keim.
    This chapter discusses the problem of vizualizing patterns found.

This will be a very good book with many great contributions, and I am excited that it will be published soon. I will keep you updated on this blog as we get closer to the release.

Update: the book is now published!

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

What I don’t like about academia

In this blog post, I will talk about academia. There are numerous things that I like about academia, and I really enjoy working in academia. But for this blog post, I will try to talk  about what I don’t like in academia to give a different perspective.


Even when we like something very much, there is always some things that we don’t like. So, here we go. Here is a list of some things that I more or less dislike in academia:

  • A sometime excessive pressure to publish: There is sometimes a great pressure on researchers to produce many publications in a given time frame, which may come from various sources. It is in part necessary as it increases productivity and ensures that researchers do not become lazy. But a drawback is that some researchers may be less willing to take risks or may focus on short-term projects rather than on more difficult but more rewarding projects.
  • Conflicts of interests at various levels. A researcher should avoid conflicts of interest. However, not everyone does and this is a problem. A few years ago, for example, I was a program committee member of a conference and discovered that a reviewer reviewed his own paper. I reported this issue to the conference organizers and that person was kicked out of the program committee. Another, example is some journal reviewers that always ask that we cite their papers in their reviews even if it is not relevant to our paper, just to increase their citation count. In my field, there is one reviewer that is especially known for doing this as several researchers talked to me about him. This is not a good behavior and I usually report it to the journal editor but since reviewers work for free, there is typically no consequence for such people. A third example is that some researchers will often give preferential treatment to their friends. For example, I ever attended a conference  where three of the awards were handed to collaborators of the conference organizer. Although these papers may be good, it remains suspicious. Another example is when I was applying for jobs in Canada, several years ago. At that time, I was one of remaining two candidates for a professor position but finally the other much less experienced researcher was chosen, due to a likely conflict of interest.
  • Predatory journals and conferences. There are many journals of very low quality that only publish to earn money. These journals usually have very broad scope, are published by unknown publishers and sometimes appear to not review papers. They also often send spam to promote their journals. This is a problem, and I obviously dislike such journals.
  • Unethical publications by some researchers. I have discovered and reported several journal papers that contained plagiarism. These papers have been generally retracted, as they should. But in some cases, unethical behavior is not so easy to detect. For example, I have ever read some papers where I thought that results were fake but there was not enough evidences to prove it. It certainly happens that some researchers publish fake results, which is bad for academia.
  • Publishers that sometimes are too greedy. It is well known that some publishers charge very high fees to universities and individuals to publish and/or access research publications. This is somewhat unfortunate because research is often funded by a government, done by researchers and reviewed for free by reviewers, while publishers are those earning money. It would be difficult to change this as popular publishers are well established and there are pressure to keep this system. On the other hand, this publication system is not that bad. Actually, the good publishers will filter many bad papers, and ensure minimum quality levels for papers, which is important.
  • Insufficient funding for research in some countries. Currently, I have a lot of funding so I cannot complain about insufficient funding. But in some other countries, funding is quite rare and often insufficient for researchers in academia. This was the case when I was working in Canada. To apply for the national funding by NSERC, we would have to write a budget requesting large amounts of money but one was considered lucky to even just get a fraction of it. Thus not so much money was available to students, for attending conferences and publications, and buying equipment. Besides, there is not enough professors at several universities in countries like Canada.
  • Reviewers that do not do their job well. As researchers, our work are evaluated by other researchers to determine if our work should be published in a given conference proceedings or journal. Generally, reviewers do a good job and do it for free, which is very appreciated. However, in some cases, reviewers don’t do their job correctly. For example, it ever happened to me that a reviewer rejected my paper because he thought the problem could be solved in a more simple way. But the solution proposed by the reviewer in his review was wrong. Having said that, a reviewer often misunderstand a paper because it is not well written. Thus, such situations are often to be blamed on authors rather than reviewers. And often when a paper is rejected there are multiple problems in the paper.
  • Unprofessional behavior. In some cases, some researchers have highly unprofessional behavior. This was for example the case for the ADMA 2015 conference, which was canceled without notifying authors, after papers had been submitted. The website just went offline and organizers just ignored emails.
  • Bad paper presentations. I have attended many international conferences. Sometimes paper presentations are good. But sometimes they are not good. There are several easily avoidable mistakes that a presenter should not do such as turning is back to the audience, exceeding the time limit, and not being prepared.

This is all for today! I just wanted to share some things that I don’t like about academia. But actually, I really like academia. You can share your own perspective on academia in the comments below, or perhaps that you may want to share solutions on how to improve academia. 😉

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

News about the data mining blog

This data mining blog has been created more than five years ago and has had a considerable success with more than 800,000 views. For this, I want to thank all the readers. Today, I will announce some important news related to this blog.

Translation of the blog

The first news is that the blog will be translated to make it more accessible in other languages. Since I work in China and there is a very large Chinese data mining community, I have recently added a Chinese translation of the data mining blog. It can be accessed by clicking the following link in the menu of this website.

chinese blog

In the Chinese version of the data mining blog, not all blog posts will be translated, but the most important ones.  Currently four posts have been translated. I have published two and the others will be published in the following weeks.

chinese data mining

I am also considering adding a French translation since I am a native French speaker.  Other languages could also be added such as Vietnamese and Spanish if volunteers are willing to help me translating to other languages.

Video tutorials about data mining and big data

The second news is that I am currently experimenting with software to record lectures and publish them online as HTML5 videos. In the near future, I will start publishing  various videos about data mining. This will include some lectures that I have given, as well as some tutorials for my SPMF data mining software. I will also record some video tutorials to present some classical data mining algorithms. Moreover, I will discuss why recording videos can be useful to promote research, in a future blog post.


In this blog post, I have given some news about future plans for the blog. Thanks again for reading and commenting. I am also looking for contributors. If you would like to contribute as a guest author or translator, just let me know.

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Report about the DEXA 2018 and DAWAK 2018 conferences

This week, I am attending the DEXA 2018 (29th International Conference on Database and Expert Systems Applications) and the DAWAK 2018 (20th Intern. Conf. on Data Warehousing and Knowledge Discovery) conferences from the 3rd to 6th September in Regensburg, Germany.

dexa 2018 dawak 2018

Those two conferences are well established European conferences dedicated mainly to research on database and data mining. These conferences are always collocated. It is not the first time that I attend these conferences. I previously attended  DEXA 2016 and DAWAK 2016 in Portugal.

These conferences are not in the top 5 of their fields but are still quite interesting, usually with some good papers. The proceedings of the conference are published by Springer in the LNCS (lecture notes in Computer Science series, which ensures that the paper are indexed by various academic databases.

Acceptance rates

For DEXA 2018, 160 papers were submitted, 35 have been accepted (22.%) as full papers, and 40 as short papers (25 %).

For DAWAK 2018, 76 papers were submitted, 13 have been accepted (17.%) as full papers, and 16 as short papers (21 %).


The conference is held at University of Regensburg, in Regensburg, a relatively small town with a long history, about 1 hour from Munich. It is a UNESCO world heritage site. The university:

dexa 2018 location

A picture of the old town:

regensburg dexa

Why I attend these conferences?

This year, my team and collaborators have four papers at these conferences, on topics related high utility itemset mining, periodic pattern mining and privacy preserving data mining:

  • Fournier-Viger, P., Zhang, Y., Lin, J. C.-W., Fujita, H., Koh, Y.-S. (2018). Mining Local High Utility Itemsets . Proc. 29th International Conference on Database and Expert Systems Applications (DEXA 2018), Springer, to appear.
  • Fournier-Viger, P., Li, Z., Lin, J. C.-W., Fujita, H., Kiran, U. (2018). Discovering Periodic Patterns Common to Multiple Sequences. 20th Intern. Conf. on Data Warehousing and Knowledge Discovery (DAWAK 2018), Springer, to appear.
  • Lin, J. C.-W., Zhang, Y. Y., Fournier-Viger, P., … (2018) A heuristic Algorithm for Hiding Sensitive Itemsets. 29th International Conference on Database and Expert Systems Applications (DEXA 2018), Springer, to appear.
  • Lin, J. C.-W., Fournier-Viger, P, Liu, Q., Djenouri, Y., Zhang, J. (2018) Anonymization of Multiple and Personalized Sensitive Attributes. 20th Intern. Conf. on Data Warehousing and Knowledge Discovery (DAWAK 2018), Springer, to appear.

The two first papers are projects of my master degree students, who will also attend the conference.  Besides, I will also chair some sessions of both conferences.

Another reason for attending this conference is that it is an European conference. Thus, I can meet some European researchers that I usually do not meet at conferences in Asia.

Day 1

I first registered. The process was quick. We receive the proceedings of the conference as a USB drive, and a conference bag.

dexa 2018 proceedings

I attended several talks from both the DEXA 2018 and DAWAK 2018 conference on the first day. Here is a picture of a lecture room.

dexa 2018 lecture

There was also an interesting keynote talk about database modelling.

dexa keynote

In the evening, a reception was held at the old town hall.

Day 2

The second day had several more presentations. In the morning I was the chair of the session on classification and clustering. A new algorithm that enhance the K-Means clustering algorithm was proposed, which has the ability to handle noise. An interested presentation by Franz Coenen proposed an approach were data is encrypted and then transmitted to a distant server offering data mining services such as clustering. Thanks to the encryption techniques, privacy can then be ensured. In the morning, there was also a keynote about “smart aging”. I did not attend it though because I instead had a good discussion with collaborators.

Day 3 – Keynote on spatial trajectory analysis

There was a keynote about “Spatial Trajectory Analytics: Past, Present and Future” by Xiaofang Zhou. It is a timely topic as nowadays we have a lot of trajectory data in various applications.

dexa trajectory data keynote

What is trajectory data? It is the traces of moving objects. Each object can be described using time, spatial positions and other attributes. Some examples of trajectory data is cars that are moving. Such trajectory data can be obtained by the GPS of cars. Another example is the trajectory of mobile phones. Trajectory data is not easy to analyze because it samples the movement of an object. Besides, trajectories are influenced by the environment (e.g. a road may be blocked). Other challenges is that data may be inaccurate and some data points may be redundant.

trajectory data

Trajectory data can be used in many useful ways such as route planning, point of itnerest recommendation, environment monitoring, urban planning, and resource tracking and scheduling. Trajectory data can also be combined with other types of data.

trajectory data applications

But how to process trajectory data? Basically, we need to monitor the objects to collect the trajectories, store them in databases (which may provide various views, queries, privacy support, and indexing), and then the data can be analyzed (e.g. using techniques such as clustering, sequential pattern mining or periodic pattern mining). Here is a proposed architecture of a trajectory analysis system:

trajectory data analysis

This is a first book written by the presenter in 2011 about spatial trajectory mining:

trajectory data book

Here are some important topics in trajectory analysis:

trajectory analysis research topics

Then, the presenter discusses some specific applications of trajectory data analysis. Overall, it was an interesting introduction to the topic.

Day 3 – Banquet

In the evening, attendees were invited to a tour of a palace, and then to a banquet in a German restaurant.

dexa dawak banquet

Day 4

On the last day, there was more paper presentations and another keynote.

Next year

DAWAK 2019 and DEXA 2019 will be hosted in Linz, Austria from the 26th to the 29th August 2019.

Best paper award

The best paper award was given to the paper “Sequence-based Approaches to Course Recommender Systems” by Osmar Zaiane et al. It presents a system to recommend undergraduate courses to student.  This system, applies algorithms for sequential pattern mining and  sequence prediction among other to select relevant courses.


Overall, the quality of papers was relatively high, and I was able to meet several researchers related to my research. It was thus a good conference to attend.


Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software.

Postdoctoral positions in data mining in Shenzhen, China (apply now)

The CIID research center of the Harbin Institute of Technology (Shenzhen campus, China) is looking to hire two postdoctoral researchers to carry research on data mining / big data.

Harbin Institute of Technology (Shenzhen)

An applicant:

  • must have obtained a Ph.D. in computer Science within the last 3 years,
  • must be less than 36 years old 
  • has a strong research background in data mining/big data or artificial intelligence,
  • have demonstrated the ability to publish papers in excellent conferences and/or journals in the field of data mining or artificial intelligence,
  • have an interest in the development of data mining algorithms and its applications,
  • can come from any country (but if the applicant is Chinese, s/he should hold a Ph.D. from a 211 or 985 university, or from a university abroad).

The successful applicant will:

  • work on a data mining project that could be related to sequences, time series and spatial data,  or some other topics related to data mining with both a theoretical part and an applied part related to industrial design (the exact topic will be open for discussion to take advantage of the applicant’s strengths),
  • join an excellent research team, led by Prof. Philippe Fournier-Viger, the founder of the popular SPMF data mining library, and have the opportunity to collaborate with researchers from other fields,
  • will have the opportunity to work in a laboratory equipped with state of the art equipment (e.g. very expensive workstations, a cluster of severs to carry big data research, GPU servers, virtual reality equipment, body sensors, and much more).
  • will be hired for 2  years, at a salary of 171,600 RMB  / year  ( 51,600 RMB from the university + 120,000 RMB from the city of Shenzhen) .  Note that there the post-doctoral researcher will pay no tax on the salary, and that an apartment can be rent at a very low price through the university (around 1500 RMB / month, which saves a lot of money).
  • work in one of the top 50 universities in the field of computer science in the world, and one of the top 10 universities in China.
  • work in Shenzhen, one of the fastest-growing city in the south of China, with low pollution, warm weather all year, and close to Hong Kong.

If you are interested by this position, please apply as soon as possible by sending your detailed CV (including a list of publications and references), and a cover letter to Prof. Philippe Fournier-Viger:  It is possible to apply for year 2019.