Report about the ICGEC 2018 conference

I have recently attended the ICGEC 2018 conference (12th International Conference on Genetic and Evolutionary Computing) from December 14-17, 2018 in Changzhou, China. In this blog post, I will describe activities that I have attended at the conference.

About the ICGEC conference

IGCEC is a good conference on the topic of Evolutionary Computing and Genetic Computing. It is the 12th edition of  the conference. It is generally held in Asia and there is some quality papers. The proceedings are published by Springer and indexed in EI, which ensures a good visibility. Besides, the best papers are invited in various special issues of journals such as JIHMSP and DSPR.  Also, there was six invited keynote speakers, which is more than what I usually see at international conferences. I am attending this conference to give one of the keynote talks on the topic of high utility pattern mining

The conference was held partly at the Wanda Realm Hotel and the Changzhou College of Information Technology (CCIT).

Hotel location icgec 2018

Changzhou is middle-sized city not very far by train from Shanghai, Wuxi and Nanjing.  In terms of tourism, Changzhou is especially famous for some theme parks, and has also some museum and temples. The city has several universities and colleges.

Changzhou icgec
Vew of Changzhou from my hotel window

Here is a picture of the conference materials (book, bag, gifts, etc.).

icgec 2018 changzhou

Opening ceremony

The opening ceremony was held by Dr. Yong Zhou, and Prof. Jeng-Shyang Pan, honorary chairs of the conference. Also. Prof. Chun-Wei Lin, general chair briefly talked about the program. This year about 200 submissions have been received and around 36 were accepted.

Keynote talks

The first keynote was by Prof. Jhing-Fa Wang about orange technology and robots. The concept of Orange Technology is interesting. It refers to technologies that are designed to enhance the life of people in terms of (1) help, (2) happiness and (3) care. As we have the concept “green technology” to refer to environment-friendly technology, “orange technology” is proposed so that we can focus on the people.  Some example of orange technology is robots that can assist senior people.

The second talk was by Prof Zhigeng Pan about virtual reality.  Prof. Pan presented several applications of virtual reality, augmented reality, and applications.

The third talk was by Prof. Xiudeng Peng about industrial applications of artificial intelligence such as automatic inspection systems, fuzzy control systems, defect marking, etc. Prof. Peng reminded us that if we are interesting in finding potential applications of AI, there are a lot of opportunities in the industry. He also stressed the importance of developing machine learning models that can be updated in real-time to feed-back, and hav online capabilities.

The fourth keynote talk was by Jiuyong Li about causal discovery and applications. The topic of causal discovery is very interesting as it aims to find causal relationships in data rather than associations. Several models have been proposed in this field to find causal rules and causal decision trees, for example. Several software by Prof. Li are open-source, and he has published a book on this topic recently.

The fifth keynote was by myself, Philippe Fournier-Viger. I presented an overview of our recent work about pattern mining, and in particular itemset mininghigh utility pattern mining, periodic pattern mining, significant pattern mining and local pattern mining. I also presented my open-source data mining software called SPMF. Finally, I discussed what I see as current research opportunities in the field of pattern mining, and how evolutionary and genetic algorithms can be used in this field (because it is the main topics of the conference).

Then, there was a last keynote talk by Dr. Peter Peng about genetic algorithms, clustering and industry applications.

Regular talks

On the second day, there was several regular paper presentations grouped by topics, including machine learning, evolutionary computing, image and video processing, information hiding, smart living, classification and clustering, applications of genetic algorithms, smart internet of things, and artificial intelligence.

Social activities

On the first day a special reception was held for invited guests and committee members at the hotel. A buffet was held at the hotel on the evening of the second day, and a banquet on the evening of the last day of the conference. Overall, there were many opportunities for discussing with other researchers, and people were very friendly.

Conclusion

The ICGEC 2018 conference was well-organized, and it has been a pleasure to attend it. Next year, ICGEC 2019 will be held in Qingdao, China, which is a nice city close to the sea. It will be organized by professors from the Shandong University of Science and Technology.

Introduction to frequent subranking mining

Rankings are made in many fields, as we naturally tend to rank objects, persons or things, in different contexts. For example, in a singing or a sport competition, some judges will rank participants from worst to best and give prizes to the best participants. Another example is persons that rank movies or songs according to their tastes on a website by giving them scores.

If one has ranking data from several persons, it is  possible that the rankings appear quite different. However, using appropriate techniques, it is possible to extract information that is common to several ranking that can help to understand the rankings. For example, although a group of people may disagree on how they rank movies, many people may agree that Jurassic Park 1 was much better than Jurassic Park 2. The task of finding subrankings common to several rankings is a problem called frequent subranking mining. In this blog post, I will give an introduction to this problem. I will first describe the problem and some techniques that can be used to mine frequent subrankings.

The problem of subranking mining

The input of the frequent subranking mining problem is a set of rankings. For example, consider the following database containing four rankings named r1, r2, r3 and r4 where some food items are ranked by four persons. 

Ranking ID Ranking
r1 Milk < Kiwi < Bread < Juice
r2 Kiwi < Milk < Juice < Bread
r3 Milk < Bread < Kiwi < Juice
r4   Kiwi < Bread < Juice < Milk

The first ranking r1 indicates that the first person prefers juice to bread, prefers bread to kiwi, and prefers kiwi to milk. The other lines follow the same format.

To discover frequent subrankings, the user must specify a value for a parameter called the minimum support (minsup). Then, the output is the set of all frequent subrankings, that is all subrankings that appear in at least minsup rankings of the input database. Let me explain this with an example. Consider the ranking database of the above table and that minsup = 3. Then, the subranking  Juice < Milk is said to be a frequent subranking because it appears at least 3 times in the database. In fact, it appears exactly three times as shown below:

Ranking ID Ranking
r1 Milk < Kiwi < Bread < Juice
r2 Kiwi < Milk < Juice < Bread
r3 Milk < Bread < Kiwi < Juice
r4   Kiwi < Bread < Juice < Milk

The number of occurrence of a subranking is called its support (or occurrence frequency). Thus the support of he subranking Milk < Juice is 3. Another example is the subranking Kiwi < Bread < Juice which has a support of 2, since it appears in two rankings of the input database:

Ranking ID Ranking
r1 Milk < Kiwi < Bread < Juice
r2 Kiwi < Milk < Juice < Bread
r3 Milk < Bread < Kiwi < Juice
r4   Kiwi < Bread < Juice < Milk

Because he support of  Kiwi < Bread < Juice is less than the minimum support threshold, it is NOT a frequent subranking.

To give a full example, if we set minsup = 3, the full set of frequent subrankings is:

Milk < Bread   support : 3
Milk < Juice     support : 3
Kiwi < Bread   support : 3
Kiwi < Juice    support : 4
Bread < Juice support : 3

In this example, all frequent subrankings contains only two items. But if we set minsup = 2, we can find some subranking containing more than two items such as Kiwi < Bread < Juice, which has a support of 2.

This is the basic idea about the problem of frequent subranking mining, which was proposed in this paper:

Henzgen, S., & Hüllermeier, E. (2014). Mining Rank Data. International Conference on Discovery Science.

Note that in the paper, it is also proposed to then use the frequent subrankings to generate association rules.

How to discover the frequent subrankings?

In the paper by Henzgen & Hüllermeier, they proposed an Apriori algorithm to mine frequent subrankings. However, it can be simply observed that the problem of subrank mining can already be solved using the existing sequential pattern mining algorithms such as GSP (1996), PrefixSpan (2001), CM-SPADE (2014), and CM-SPAM (2014).  This was explained in an extended version of the “Mining rank data” paper published on Arxiv (2018) and other algorithms specially designed for subranking mining were proposed.

Thus, one can simply apply sequential pattern mining algorithms to solve the problem. I will show how to use the SPMF software for this purpose. First, we need to encode the ranking database as a sequence database. I have thus created a text file called kiwi.txt as follows:

@CONVERTED_FROM_TEXT
@ITEM=1=milk
@ITEM=2=kiwi
@ITEM=3=bread
@ITEM=4=juice
1 -1 2 -1 3 -1 4 -1 -2
2 -1 1 -1 4 -1 3 -1 -2
1 -1 3 -1 2 -1 4 -1 -2
2 -1 3 -1 4 -1 1 -1 -2

In that format, each line is a ranking. The value -1 is a separator and -2 indicates the end of a ranking.  Then, if we apply the CM-SPAM implementation of SPMF with minsup = 3, we obtain the following result:

milk -1 #SUP: 4
kiwi -1 #SUP: 4
bread -1 #SUP: 4
juice -1 #SUP: 4
milk -1 bread -1 #SUP: 3
milk -1 juice -1 #SUP: 3
kiwi -1 bread -1 #SUP: 3
kiwi -1 juice -1 #SUP: 4
bread -1 juice -1 #SUP: 3

which is what we expected, except that CM-SPAM also outputs single items (the first four lines above). If we dont want to see the single items, we can apply CM-SPAM with the constraint that we need at least 2 items, then we get the exact result of all frequent subrankings:

milk -1 bread -1 #SUP: 3
milk -1 juice -1 #SUP: 3
kiwi -1 bread -1 #SUP: 3
kiwi -1 juice -1 #SUP: 4
bread -1 juice -1 #SUP: 3

We can also apply other constraints on subrankings such as a maximum number of items using CM-SPAM. If you want to try it, you can download SPMF, and follows the instructions on the download page to install it. Then, you can create the kiwi.txt file, and run CM-SPAM as follows:

You will notice that in the above window, the minsup parameter is set to 0.7 instead of 3. The reason is that in the SPMF implementation, the minimum support is expressed as a percentage of the number of ranking (sequences) in the database. Thus, if we have 4 rankings, we multiply by 0.7, and we also obtain that minsup is equal to 3 rankings (a subranking will be considered as frequent if it appears in at least 3 rankings of the database). 

What is also interesting, is that we can apply other sequential pattern mining algorithms of SPMF to find different types of subrankingsVMSP to find the maximal frequent subrankings, CM-CLASP to find the closed subrankings, or even VGEN to find the generator subrankings.  We could also apply sequential rule mining algorithms such as RuleGrowth and CMRules to find rules between subrankings.

Conclusion

In this blog post, I discussed the basic problem of mining subrankings from rank data and that it is a special case of sequential pattern mining. 

The DSPR journal is growing… and we are looking for your papers!

This week, I want to talk to you about a journal that I am one of the editors-in-chief of, called Data Science and Pattern Recognition. The journal has been established in 2017 and currently 18 papers have already been published, are in press, submitted or in preparation. Although the journal is not published by a very famous publisher, it is an upcoming journal that has received a very good success so far and is growing quickly.  In particular, the journal has a strong foundation. The editorial board includes several famous researchers from the field of data mining, that have given their support to the journal. Moreover, for the first 13 published papers, the average impact factor is about 7 citations / paper, which is high. The journal also has some highly cited papers. One of them has more than 70 citations.data science and pattern recognition

 What is in the future of the DSPR journal 2019?

We are looking to reach the milestone of 40 published papers this year to submit the journal to EI so that it become an EI indexed journal. This is a very important step to make the journal grow even faster.

Looking forward to receive your paper 😉

Lastly, I would like to thank everyone who is supporting the journal, and also mention that we are now looking for papers of various lengths and on various topics related to data science/data mining and pattern recognition.  It can be  research papers or a survey papers. If you are interested to write a paper for the journal, you may contact with us for more information or visit the website. Also if you have some paper that is not yet submitted but is in the work, please consider submitting it to DSPR!


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 150 data mining algorithms.

Is the next artificial intelligence winter coming?

In recent years, there have been an increased interest in Artificial Intelligence (AI). This is due  in part to some advances for training and building neural networks, which have allowed to solve some difficult problems with greater success. This has lead to some very large investments in AI both from various governments and companies, and an increased interest in academia, and in the mainstream media for AI. This provides a lot of funding and opportunities for AI researchers and companies. However, when there is a lot of hype and expectations, there is real  risk that expectations will not be met and that the bubble will burst. For AI, this has already happened twice in the 70s and 80s after expectations were not met. At these moments, AI funding greatly decreased for several years. These periods are called AI winters. Will there be another AI winter soon? I will discuss this in this blog post.

AI Winter

Recent breakthroughs in AI

During previous AI winters, many people have been disappointed by AI due to its inability to solve complex problems. One of the reason was that the computing power available at that time was not enough to train complex models. Since then, computers have become much more powerful. Recently, some of the greatest breakthrough in AI have been made possible due to the increase in computing power and the amount of data. For example, deep learning has emerged as a key family of machine learning models, which is basically neural networks with more hidden layers, trained on GPUs to obtain more computing power.  Such models have allowed to perform some tasks very well in particular related to image classification, labelling, speech processing and language translation. For example, the ImageNet computer vision task has been solved with a very high accuracy by the AlexNet model using GPUs to train neural networks a few years ago. Then, various other improvements have been done such as for generating content using adversarial networks and using reinforcement learning for game playing (e.g. AlphaGo).

However, it can be argued that these models do some tasks better than previous models but actually do not do something really new. For example, although increasing the accuracy of document translation or image classification is useful, we are still very far from having models that do something much more complicated such as writing a text that make sense or having a real conversation with humans (not just a scripted chatbot!). It also seems clear that just increasing the computing power with more GPUs will not be enough to achieve much more complicated tasks.  To achieve  “General Artificial Intelligence” some key aspects such as common sense reasoning must be considered, which are lacking in current models. Thus, current deep learning models can only be seen as a small step toward a truly intelligent machine, and more research will be needed.

In fact, it can be observed that the biggest recent breakthrough are limited to some specific areas such as image and speech processing. For example, this year, I visited the International Big Data Expo 2018 in China and there was many companies displaying computer vision based products using deep learning, such that after a while, we may wonder what other problems can it solve?

Huge expectations towards AI  and the need for a return on investment

There is no doubt that AI is very useful. But the huge expectations towards AI that some investors currently have are  dangerous as it seems that some will not be met in the short-term. And this could lead to a disappointment, and a decrease in investments (a winter).

For example, currently one of the most popular applications of AI discussed in the medias is self-driving cars. Huge sums of money have been invested in this technology by multiple companies. However, when we see the recent car crashes  and deaths caused by prototype self-driving cars in the US, it is clear that the technology is not 100 % safe. I think that the only way that safety could be achieved for such cars is if only self-driving cars would be on the road, but this will not happen anytime soon. And who would like to be in a car that is not safe? Thus, such research has the potential to lead to a huge disappointment in the short term as investors may not see the return on investment. Another example is the research by giants such as Amazon on drone delivery. It is certainly an interesting idea but in practice such technology will be met by many practical problems (what if a drone crashes and kill someone? what about if people start shooting these drones or blinding them with laser? how much weight can these drones carry? and would it even make sense economically?).

There is also a lot of hype in the media promising that AI could replace many jobs in the near future, including those of radiologists. Moreover, some researchers have even started to discuss in the media about dangers of AI, which seems very far-fetched as we are nowhere close to some general artificial intelligence. But all this discussion increases the expectations of the general public towards AI. To take advantage of the hype on AI, more and more consumer products are said to be “powered by AI” such as the cameras of cellphones. Even the Bing search engine has been updated with a chatbot, which actually does not appear to be much “smarter” than the chatbots of the 1990s (see pictures below).

The 2018 Bing chatbotbing confused chatbot

 

 

 

 

 

 

 

 

For companies, what will determine whether there is a next AI winter is whether they can see a clear return on investment when hiring expensive AI specialists to develop AI products. If the return on investment is not there, then the funding will disappear and projects will be terminated.

I have recently discussed with a top researcher at the ADMA 2018 conference, which has many relationships with the industry and he told me that many companies currently don’t see the return on investment for their AI projects. That researcher made a prediction that an AI winter could occur as early as in the next 6 months. But, we never know. This is very hard to predict. It is a bit like trying to predict the stock market!

Personally, I could see some AI winter happening in the next few years. But I think that it could be perhaps a soft winter, where the interest will perhaps decrease but there will always remains some interest as AI is useful. What is your prediction ?  Please leave your comments below!


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

Report about the 13th ADMA conference (ADMA 2018)

I have recently attended the 13th International Conference on Advanced Data Mining and Applications (ADMA 2018) in Nanjing, China from the 16th to 18th October 2018. In this blog post, I will give a brief report about this conference.

ADMA 2018 conference

What is the ADMA conference?

ADMA is a conference on data mining, which is generally held in China, and sometimes in other parts of Asia. It is a overall a decent conference. In particular, the proceedings are published by Springer in their Lecture Notes in Artificial Intelligence, which ensures a good visibility of the accepted papers, and all papers are indexed in EI and DBLP. One of the particularity of this conference is that it has a focus on applications of data mining.

The ADMA conference has started in 2005 and was held every year until 2014. I have attended ADMA 2011, ADMA 2013 and ADMA 2014, and also had a paper in ADMA 2012. In recent years, I had submitted papers to the ADMA 2015 which was cancelled. Then, since ADMA 2016, the conference has been held every year, with quality papers, and this year, I am glad to be back at attending ADMA.

Location

The conference was held at the Mariott hotel in Nanjing. Nanjing is the capital of the Jiangsu province in China. Nanjing has a long history and has been the capital of several Chinese dynasties. There are many things to see, and it is close to some other popular cities like Suzhou.

adma conference location

Schedule

The main conference was held on two days, while a third day was used for some doctoral student forums. For the main conference, there was two keynote speakers in the morning of each day. Then, in the afternoon, there was paper presentations. Due to this tight schedule, all papers were either selected to be presented in 10 or 15 minutes (including the questions). In the evening, there was a reception and a banquet on the first and second day, respectively.

Acceptance rate

It was announced that 104 research papers have been submitted this year from 20 countries and 5 continents. A total of 46 papers were accepted.  From these papers, 24 were selected for a long presentations, while 22 for short presentations. Both types of papers had the same number of pages in the proceedings. Thus, the overall acceptance rate is 44.2%. Here is some slide about the review process, from the opening ceremony:

adma review process

Registration

On the first day, it was conference registration. We received the conference program, badge, pen, notebook and a laser pointer as gift. The conference proceedings was on the USB of the laser pointer.adma conference registration

Welcome speech

Then, there was a brief introduction by some high ranking representative (dean?) of the Nanjing University of Aeronautics and Astronautics, which organized the conference. Then, the local organizers gave some information about the conference.

adma conference opening

And we took a group picture.

adma 2018 conference attendance

Day 1 – Keynote by Xuemin Lin on Graph Data Mining

The first keynote was done by Xuemin Lin, the editor-in-chief of TKDE, one of the top data mining journals. The talk was about graph analysis. In the first part of the talk, some applications of graph analysis were introduced such as:

  • detection of fraud in a social graph (people collaborating to commit fraud). These can be found for example, by mining rings or bi-cohesive subgraphs,
  • product recommendation, where customers, preferences and purchase products and locations are put in a graph model (a multidimensional graph).
  • planning the delivery of food and products to homes in an efficient way.

Then, some key challenges for graph analysis have been presented: define a new computing platform, analytic models, processing algorithms, indexing techniques, and processing systems (primitive operators, query language, distributed techniques, storage, etc.) for graphs. In other words, we need to define new models and software specialized for analyzing graphs.

graph analysis challenges

Finally, several problems related to graph analysis were briefly discussed.

subgraph analysis problems

Overall, this was a good keynote talk as it gave a good and up-to-date overview of several graph analysis problems.

Day 1 – Keynote by Ekram Hossain on Deep Learning for Resource Allocation in Wireless Networks

The talk was about using stacked auto-encoders for resource allocation in wireless network. The main resources are channel, transmission power of a radio station (power allocation) and antennas (shared by many users – how we allocate to many users).  The speaker was some specialist from the field of communication.

In theory, this should have been a very interesting talk. But a problem with this talk was that the speaker spent most of the time explaining basic concepts of machine learning, and ran out of time before talking about how he was actually using deep learning for resource allocation (which was supposed to be the key part of the talk).

Day 1 – paper presentations

There was several paper presentations about various topics related to data mining such as clustering, outlier detection and pattern mining. I also presented a paper about the project of my student, which is to discover change points of high utility patterns in a temporal database of customer transactions:

adma paper presentationFournier-Viger, P., Zhang, Y., Lin, J. C.-W., Koh, Y.-S. (2018). Discovering High utility Change Points in Transactional Data. Proc. 13th Intern. Conference on Advanced Data Mining and Applications (ADMA 2018) Springer LNAI, 10 pages.

Day 1 – Reception

Then, in the evening a buffet diner was offered at the Mariott Hotel, which was a good opportunity for discussing with other researchers.

Day 2

On the second day, there was more keynote and paper presentations.

Day 2 – banquet

Then, there was a banquet at the hotel of the conference.

Next year: the ADMA 2019 conference

It was announced that the ADMA 2019 conference will be held in Dalian, China from the 21st to 23rd November 2018.  The planned dates for ADMA 2019 are as follows:

  • Paper submission: 10th May 2019
  • Demo: 10th June 2019
  • Tutorial: 1st August 2019
  • Competition: 15th August 2019
  • Research student forum: 10th September 2019

Conclusion

It was an interesting conference. Although it is not a very big conference, there was some good keynote speakers, and I had some very good discussions with other researchers. Looking forward to the 14th ADMA conference (ADMA 2019 conference) , next year.

==
Philippe Fournier-Viger is a professor and also the founder of the SPMF open-source data mining software, which offers more than 130 data mining algorithms.

How to calculate Erdös numbers? and co-authorship relationships in academia

There exists many ways of analyzing the relationships between co-authors in Academia. In this blog post, I will talk about a fun measure called the “Erdös number“, which has been proposed in the field of mathematics in the 90s. The Erdos number of a person is the distance to Paul Erdos when considering co-authorship links on academic publications. For example, if you have written a paper with Paul Erdos, you have an Erdos number of 1. If you have written a paper with a co-author of Erdos, then your Erdos number is 2. And so on.

The concept of Erdos number is based on the concept of “degree of separation” between people in a social network. The idea is that everyone should never be very far apart from any other person through their social links. Maybe you wonder Why using Erdos as reference for this measure? The reason is that Paul Erdos is one of the most prolific authors in mathematics, with more than 1000 papers. Thus, Paul Erdos is widely connected to other researchers. But other people can also be used to compute such numbers.

What is your Erdos number?

If you want to compute your distance with Erdos or any other researcher in fields related to mathematics or computer science, a good way is to use the MathSciNet website. It let you compute your collaboration distance to any other people. It may not consider all publications but should give a quite accurate results. For example, I have used it to make a few tests to compute my distance to Paul Erdos, Albert Einstein and Alan Turing. The results are below:

erdos number calculation

Thus, according to this tool, my Erdos, Eistein and Turing numbers are N =  4, 6, and 7, respectively.  If you have collaborated with me, and upper bound on your numbers is thus N+1. All of this, does not mean much as our contributions to sciences are much smaller than those of these geniuses. But it shows that researchers are often not far apart.

Conclusion

This was just a short blog post to show you this interesting tool for calculating the distance between researchers in academia. It is not a new concept, but I think it is interesting. It shows that indeed people are never very far apart in academia. What is your Erdos number? You can share it in the comment section below!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

(video) Minimal Periodic Frequent Itemset with PFPM

This is a video presentation of the paper “PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures” about periodic pattern mining using PFPM. It is part of my new series of videos about data mining algorithms

(link to download the video if the player does not work)

More information about the PFPM algorithm are provided in this research paper:

Fournier-Viger, P., Lin, C.-W., Duong, Q.-H., Dam, T.-L., Sevcic, L., Uhrin, D., Voznak, M. (2016). PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures. Proc. 2nd Czech-China Scientific Conference 2016, Elsevier, 10 pages.

The source code and datasets of the PFPM algorithm can be downloaded here:

The source code of PFPM and datasets are available in the SPMF software.

I will post more videos like this in the near future. If you have any comments, please post them in the comment section!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

How journal paper similarity checking works? (CrossCheck)

In this blog post, I will talk about the recent trends of journal editors rejecting papers because of their similarity with other papers as detected by the CrossCheck system. I will explain how this system works and talk about its impact, benefits and drawbacks.

similarity checking

What is similarity checking?

Nowadays, when an author submit a paper to a well-known academic journal  (from publishers such as Springer, Elsevier and IEEE), the editor will first submit the paper to an online system to check if the paper contain plagiarism. That system will compare the paper with papers from a database created by various publishers and websites to check if the paper is similar to some existing documents. Then, a report is provided to the journal editor indicating if there is some similarity with existing documents. In the case where the similarity is high or some key parts have clearly plagiarised from other authors, the editor will typically reject the paper. Otherwise, the editor will submit the paper to reviewers and then start the normal review process.

Why checking the similarity with other papers?

There are two reasons why editors perform this similarity check:

  • to quickly detect plagiarized papers that should clearly not be published.
  • to check if a paper from an author is original (i.e. if it is not too similar to previous papers from the same author).

In the second case, some journal editors will say for example that the “similarity score” should be below 20% or 40%, depending on the journal. Thus, under this model, an author is allowed to reuse just little bit of text in his own papers.

How does it works?

Now you perhaps wonder how that similarity score is calculated. Having access to some similarity report generated by the CrossCheck system, I will provide information about what these reports looks like and then explain some key aspects of this system.

After the editor submit a paper to CrossCheck, he receives a report. This report contains a summary page that looks like this:

Part of a CrossCheck similarity report

This report gives an overall similarity score of 32%. It can be interpreted as that on overall 32 % of the content of the text  matches with existing documents. It is furthermore said that 4 % is a match with internet sources, 31% with some other publications and 2% with student papers.  And as it can be observed,  31% + 2% + 4 % does not add up to 32%.  Why?  Actually, the calculation of the similarity score is misleading. Although I do not have access to the formula or the source code of the system, I found some explanation online. It is that the similarity score  is computed by matching each  part of a text with at most one document. In other words, if some paragraph of a submitted paper match with two existing documents, this paragraph will be counted only once in the overall score of 32 %.

An annotated pdf is also provided to the editor highlighting the parts that are matching with existing documents. For example, I show some page of such report below, where I have blurred the text for anonymization:

Detailed similarity comparison (blurred)

Detailed similarity comparison (blurred)

In such report, matching parts are highlighted in different colors and some number indicates which documents has been matched to which part of the text.

Limitations of this similarity checking systems

I will now describe some problems that I have observed about the report made by this similarity checking software:

  • In the above report, the countries of authors and their affiliations is considered as matching with their previous documents, which increases the similarity score. But obviously, this should not be taken into account. Can we blame an author for using the same affiliations in two papers?
  • Keywords are also considered as matching with previous documents. But I don’t think that using some of the same keywords as another paper should not be an issue.
  • Some of the matches are some very generic expressions or sentences used in many papers such as “explained in the next section” or “this paper is organized as follows”.
  • Another limitation is that this similarity checks completely ignores all the figures or illustrations. Thus if an author extends a conference paper as a journal papers and adds many figures for experiments to further differentiate his two papers, these figures will be completely ignored for calculating the similarity score.
  • Actually, the similarity checking system is limited to the text content of the paper. It can check the main text and the text in the tables, algorithms, math formulas, biography and affiliations. But it cannot check the text in figures that are included as bitmaps (pictures) in a paper. For example, if one includes an algorithm in a paper as a bitmap instead of as text, then the system will ignore that content. The system will only be able to compare the labels of the figures and not their content. Thus, an author with malicious intent could easily hide content from the matching system by transforming some content of an article as bitmap.
  • In the report that I have analyzed, I have found that  the bibliography is also considered when computing the similarity score. Obviously this seems quite unfair. Citing the same references as some other papers (especially when it is from the same author) is not plagiarism. In the case of the report that I have read, about 90% of the references were considered as matching those of several other documents, which increased the similarity score by probably at least 10%. But I have noticed that the editor can deactivate this function.
  • I have also observed that the system can also match the biography of authors at the end of the paper and the acknowledgements with those of their previous papers. This is also a problem. It is clearly not plagiarism to reuse the same biography or acknowledgement in two papers. But in that system, it increases the similarity score.

Thus, my opinion is that this system is quite imperfect. And in fact, it is not claimed that it is a perfect system.

What are the impact of this system?

The major impact is that many plagiarized papers can be detected early which is a good thing, as detecting these papers can save a lot of time to editor and reviewers.

However, a drawback of this system is that these metrics are clearly imperfect and there is a real danger that some editor just check the similarity score to take a decision on a paper and do not read a report carefully. For example, I have heard that some journals simply apply some arbitrary  thresholds such as rejecting all papers with a score >= 30 %. This is in my opinion a problem if that threshold is too low because in some cases it is justified that an author reuses text from his own previous papers. For example, an author may want to reuse some basic problem definitions from his own paper in a second paper with a different contributions. Or an author may want to extend a conference into a journal paper with some new contributions. In such case, I think that accepting some overlap between papers is reasonable.

A few years ago, when such system were not in use, it was quite common that some authors would extend a conference paper into a journal paper by adding 50 % more content. Today, with this system, I think that this may not be allowed anymore, maybe forcing authors to avoid publishing early results in conference papers (or otherwise having to spend extra time to rewrite their paper in a different way to extend it as a journal paper).

Another aspect is that such system needs to create a database of all papers. But should the authors have to agree so that their papers are put in this database?  Probably not because when a paper is published, the authors typically have to give the copyright to the publisher. Thus, I guess that the publisher is free to share the paper with such similarity checking service. But still, it raises some questions. If we make a comparison, there exists a homework plagiarism checking system called TurnItIn. This system have been actually legally challenged in the US and Canada, where some students have won some court battle so that their homework are not submitted/included in the system. Although, it is a slightly different situation, we could imagine that some people may also want to challenge journal similarity checking systems.

How to get a similarity checking report for your paper?

Checking the similarity of a paper is not free. However, editors or associate editors of journals have a subscription to use the similarity checking service. Thus, if you know an editor or associate editor that has a subscription, he may perhaps be able to generate a report for your paper for free. Otherwise, one could pay to obtain the service.

Conclusion

In this blog post, I provided an overview of the similarity checking system called CrossCheck used by several publishers and journals. I also talked about how scores appear to be computed and some limitations of this system, and its impact in the academic world.  Hope this has been interesting. Please share your comments in the comment section below.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Skills needed for a data scientists? (comments on the HBR article)

Recently, I have read an article of the Harvard Business Review (HBR) website about data sciences skills for businesses. This article proposes to categorize skills related to data on a 2×2 matrix where skills are labelled as useful VS not useful, and time-consuming VS not time-consuming. The author of that article has drawn such a 2×2 matrix illustrating the needs of his team (see below).

Obtained from Harvard Business Review

This matrix has received many negative comments online, in the last few days. These comments have mainly highlighted two problems:

  • Why mathematics and statistics are viewed as useless?
  • Data science is viewed as useful but mathematics and statistics are viewed as useless, which is strange since math and stats are part of data science.

Having said that, I also don’t like this chart. And many people have asked why it is published in Harvard Business Review (a good magazine). But  we should keep in mind that this chart illustrates the needs of a company. Thus, it does not claim that mathematics and statistics are useless for everyone. It is quite possible that this company does not see any benefits in taking mathematics and statistics courses or training. Following the negative comments, the author and editor of HBR have reworded some parts of the article to try  to make clearer that this should be interpreted as a case study.

A part of the problem related to this chart and article is that the term “data science” has always been very ambiguous. Some people with very different backgrounds and doing very different things call themselves data scientists. This is a reason why I usually don’t use this term. And it could be a part of the reason why this chart shows a distinction between data science, math and stats, which I would describe as overlapping.

From a more abstract perspective, this article highlights that some companies are not interested into investing into skills that takes too much time to acquire (have no short-term benefits).  For example, I know that some companies prefer to use code from open-source projects or ready-made tools to analyze data rather than spending time to develop customized tools to solve problems. This is understandable as the goal of companies is to earn money and there are many tools available for data analysis.  However, one should not forget that using these tools often requires to possess an appropriate background in mathematics, statistics or computer science to choose an appropriate model given its assumptions and correctly interpret the results. Thus having those skills that take more times to acquire is also important.

What is your opinion about this chart and the most important skills for a data science?  Please share your opinion in the comment section below.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

(video) Minimal High Utility Itemset Mining with MinFHM

This is a video presentation of the paper “Mining Minimal High Utility Itemsets” about high utility itemset mining using MinFHM. It is the first video of a series of videos that will explain various data mining algorithms.

(link to download the video if the player does not work)

More information about the MinFHM algorithm are provided in this research paper:

Fournier-Viger, P., Lin, C.W., Wu, C.-W., Tseng, V. S., Faghihi, U. (2016). Mining Minimal High-Utility Itemsets. Proc. 27th International Conference on Database and Expert Systems Applications (DEXA 2016). Springer, LNCS, 13 pages, to appear

The source code and datasets of the MinFHM algorithm can be downloaded here:

The source code of MinFHM and datasets are available in the SPMF software.

I will post videos like that perhaps once every week or every few weeks.  I actually have a lot of PPTs to explain various algorithms on my computer but I just need to find time to record the videos.  In a future blog post, I will also explain which software and equipment can be used to record such videos. This is the first video, so obviously it is not perfect. I will make some improvements in the following videos.  If you have any comments, please post it in the comment section!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.