Do not link to impact factors, they will censor you!

On July 20 2017, I received an e-mail from a company called Clarivate Analytics Trademark Enforcement ( ) about copyright infringement for the Journal Citation Reports, a product by Thomson Reuters.

They contact with me because a few years ago I have created a webpage that provides a list of useful links for researchers such as how to check if a journal is indexed by SCI, SCIE, IS, etc.  It is a very simple webpage that do not host any copyright data, and is very useful for researchers. You can see a screenshot below:

A webpage with links

Copryight claim

So what is the problem with this webpage? Well, the company did not like that I linked to two websites called SCIJ***  and bioxb…com  that provide the impact factors of journals, which is a very useful information for researchers.  They asked me to remove these two links as these websites apparently contain copyrighted data (the impact factors).

It should be clear that I do not host this data on my website, and I do not own these two websites either. Thus I am not responsible for their content. However, I have still decided to censor the links as you can see above, although I probably do not have to since linking to a website is generally not considered copyright infridgment.

A problem for academia

So what does this tells us about Thomson Reuters? I think that they would expect every researcher to pay to access the impact factors. But not all universities have the funding to pay to access this data, which is very inconvenient for young researchers or people working in small universities or less developed countries. To avoid such problems, I think that as researchers, we should try to move away from impact factors provided by a company.  An impact factor is basically just calculated using some simple math formula using data provided by journals. There should be a way to have alternatives to these impact factors that are not calculated or owned by a company.  

The Streisand effect

Besides, can they really expect to censor links to these two websites all over the Internet? In the past, many people have failed to do this, and this has backfired. For example, there is the well known Streisand effect where the fact of trying to hide something makes it even more popular.

This is all I wanted to write for today. Just wanted to share this story.

By the way, all the names mentioned in this blog posts belong to the respective companies.

Related posts:

Posted in Academia, General | Tagged , , , | Leave a comment

How to publish in top conferences/journals? (Part 2) – The opportunity cost of research

Many researchers wish to produce high quality papers and have a great research impact. But how? In a previous blog post, I have discussed how the “blue ocean strategy” can be applied to publish in top conference/journal. In this blog post, I will discuss another important concept for producing high impact research, which is to consider the opportunity cost of research.

Opportunity cost in research

Opportunity cost

The concept of opportunity cost is widely used in the field of economy. I will explain this concept and then explain how it can be applied to research. Consider a situation where you must choose between several mutually exclusive choices C1, C2, … Cn.  If you choose a choice Cx, then you cannot choose the other choices because they are mutually exclusive with Cx.  Thus, by making a choice Cx, you may be getting some benefits, but you may miss other benefits that could be obtained when making other choices. In other words, when we make a choice in a given situation, we not only get the rewards or benefits that goes with that choice but we also lose other benefits that we could have obtained if we had made other choices.  The opportunity cost for making a choice Cx is thus the lost of benefits caused by not making other alternative choices.

Applying the concept of opportunity cost in research

The concept of opportunity cost is simple but very useful in many domains, and considering it allows to take better decisions. When making a choice between multiple alternatives, we should not only think about the direct benefits of that choice but also at the missed opportunities from other alternative choices (the opportunity cost).

How does that applies to research?  Well, in research, a researcher has multiple resources (times, money, students). In terms of time, when a researcher decides to spend time on a given project, he may as a result not have time to work on alternative research projects. Thus, given the limited amount of time that a researcher has, he must carefully choose between multiple research opportunities to maximize his benefits.

For example, it is tempting for several young researchers to write several papers with simple an unoriginal ideas, just to publish papers as quickly as possible. This may seems like a good idea in the short term. However,  the hidden opportunity cost of doing so is that the time spent on writing these simple papers cannot be spent to work on better research ideas that take more time to develop but may result in a higher impact on the long term.

Thus, from the perspective of opportunity cost, a researcher should try to choose carefully the research projects that are promising and spend more time to develop these projects and make good papers out of it, rather than to try to write as many papers as possible.

From my experience, even the most simple conference papers requires at least 1 or 2 weeks of work that could be spend on better research projects. The fastest paper that I have written for a conference was done in about 1 week, several years ago. But still,  even if it only takes a week, one should not forget that additional time needs to be spent to travel, prepare a PowerPoint, and present the paper at a conference. Thus, totally, the cost in terms of time for a quick conference paper may still add up to two weeks or more. And then, as explained, the opportunity cost of writing a simple paper is that one may not have enough time to work on better or more promising research ideas.  Thus, in recent years, I have shifted my focus to target better conferences and journals and focus less on smaller conferences.  I write less papers but these papers are of higher quality and can have a greater impact.

Of course, many other factors must be considered to publish high quality papers such as the choice of a good research topic. But in this blog post, I wanted to highlight the concept of opportunity cost.


In this blog post, I have highlighted how opportunity cost is applicable to research in academia. I personally think that it is an important concept, as many researchers are tempted to publish as many papers as possible without focusing much on quality, and often without realizing that that time could be spent on better research ideas. But producing high impact research requires to spend a considerable amount of time. Thus, one should carefully chose to spend more time on promising projects, rather than focusing too much on quantity. Besides time, the concept of opportunity cost is also applicable to other kinds of resources such as funding and students.

If you like this blog, you can subscribe to the RSS Feed or my Twitter account ( to get notified about future blog posts

Related posts:

Posted in Academia, Data Mining, Research | Tagged , | 1 Comment

Plagiarism by K. Raghu Naga Dhareswararao, T. Kishore

After the recent case of plagiarism at the  Ilahia college of engineering that I have reported recently, where no actions was taken at all to punish the professors who have plagiarized my paper, I have found today that another of my papers has been plagiarized by some Indian researchers.

I will write a short blog post about this topic because I am a little bit tired of writing about people from India who are plagiarizing my papers, as it happens several times every year. There are some excellent researchers in India. In my opinion, the problem of plagiarism is mostly in smaller colleges.

The plagiarized paper is published in an unknown journal named IJSEAT (International Journal of Science Engineering and Advance Technology). The title of this journal is very broad covering all fields of science and technology. Usually, this kind of journal is set up just for money by choosing a title as general as possible to collect more papers.

The paper is named  “AN SUITABLE MINIMUM UTILITY THRESHOLD BY TRIAL AND ERROR IS A TEDIOUS PROCESS FOR USERS” and is written by K. Raghu Naga Dhareswararao, T. Kishore ( )


The title of the paper already shows that the authors are not good at English writing.  It should be  “A suitable” rather than “An suitable”.

Then, there is the abstract:

We address the above issues by proposing another system for top-k high utility itemset mining, where k is the coveted number of HUIs to be mined. Two sorts of effective calculations named TKU (mining Top-K Utility item sets) and TKO (mining Top-K utility item sets in One stage) are proposed for mining such item sets without the need to set min util. We give a basic correlation of the two calculations with exchanges on their preferences and constraints. Exact assessments on both genuine and engineered datasets demonstrate that the execution of the proposed calculations is near that of the ideal instance of cutting edge utility mining algorithms.

In this abstract, the authors basically claim that the have proposed the TKU and TKO algorithms presented in a TKDE paper that I have co-authored.  That paper is not cited. Thus it is a clear case of plagiarism, which is unacceptable.

What will happen?

As I usually do, I will first send an e-mail to the editors of that journal to ensure that it will be retracted. Then I will send an e-mail to the department of their university so that punishment might be given to these authors.  In the past, this has generally worked. All the journals that I have contacted have retracted the plagiarized papers. However, some colleges like the Ilahia college of engineering and Galgotias college have simply ignored the cases of plagiarism that have happened in their colleges, by taking no actions at all. For example, I have contacted the head of department and principal of the Ilahia college of engineering  multiple times over several months before receiving an answer, and still no punishment was given.  Thus, in these colleges, the professors who committed plagiarism are still working as if nothing happened, which is a shame. In any western country, a professor committing plagiarism would likely be fired.

Anyway, I just write this blog post so that people know about these cases of plagiarism.

Related posts:

Posted in Academia, Plagiarism, Research | Tagged , , | Leave a comment

The PAKDD 2017 conference (a brief report)

This week, I have attended the PAKDD 2017 conference in Jeju Island, South Korea, this week, from the 23 to 26th May.  PAKDD is the top data mining conference for the asia-pacific region. It is held every year in a different pacific-asian country. In this blog post, I will write a brief report about the conference.


Conference location

The conference was held in the city of Seogwipo on Jeju island, a beautiful island in South Korea, which is famous for tourism, especially in Asia. Here is a map of the location.

PAKDD 2017 location

In particular, the conference was held at the Seogwipo KAL hotel.

PAKDD 2017 hotel korea

The hotel was well-chosen. It is about 1 km from the city, beside the sea.

Conference proceedings

The proceedings of the PAKDD conference are published by Springer in the Lecture Notes in Artificial Intelligence series. This ensures a good visibility to the papers published in the proceedings, which are indexed in the main computer science indexes such as DBLP.

The proceedings were given on a USB drive (4 Gb) rather than as a book, as many other conferences have been doing in recent years. Personally, I like to have proceedings as books, but USB drives are probably more friendly for the environment.

PAKDD 2017 proceedings

In general the quality of the papers at PAKDD conferences is good. This year,  458 papers were submitted. Among those, 45 papers were accepted as long papers and 84 as short papers. Thus, the global acceptance rate was about 28%.

Below, I present various slides from the opening ceremony presentation, which provides information about the PAKDD conference this year.

  1. The number of papers per category (submitted / accepted) is shown below. It is interesting to see that a large amount of applications and social network papers have been rejected. And for the topic of sequential data,  only 1 paper out of 10 was accepted.

PAKDD papers per category

2) The number of accepted long and short papers at PAKDD forthe last six years is presented below.PAKDD accepted papers

3) The decision criteria for accepting a paper at PAKDD are shown below.PAKDD decision criteria

4) There was 283 persons who have registered for PAKDD this year.PAKDD registration

5) The acceptance rate of long and short papers at PAKDD during the last six years PAKDD acceptance rate

6) The number of submitted vs accepted papers by country this year.  We can observe that China has the largest number of papers accepted and submitted.

PAKDD per country


Day 1 – workshops and tutorials, reception

On the first day, the registration started at 8:00 AM.

PAKDD 2017 registration

It was then followed by various workshops and tutorials. I have attended a workshop about Biologically Inspired Data Mining, a popular topic, which covers the applications of algorithms such as neural networks, bee swarm optimization, genetic algorithms, and ant colony optimization, to solve data mining problems. Evolutionary algorithms are quite interesting as they can  find approximate solutions to data mining problems that are quite good solutions, while running much faster than traditional algorithms that  find an optimal solution.  There was also some tutorials that I did not attend on information retrieval, recommender systems and tensor analysis. Besides, there was workshops on security, business process management, and sensor data analytics.

In the evening, there was a reception, which was a good opportunity for discussing with other researchers.

Day 2 – main conference, opening ceremony

There was an opening ceremony, followed by a keynote by Sang Kyun Cha from the Seoul National University of Korea. The keynote was about a potential fourth industrial revolution that would occurs due to the growth of AI-based services and big data technologies. This would lead to a need for more skilled workers such as engineers or “data scientists”. The talk was interesting but personally I prefer talks that are a little bit more technical. After that, there was multiple sessions of paper presentations.

Besides technical sessions, I also discussed with some representatives from Nvidia who were promoting a new supercomputer specially designed for training deep learning neural networks. It is called NVIDIA DGX-1 and costs around 200,000 $ USD. According to the promotional material, this computer has a eight Tesla P100 GPUs, each with 16 GB of memory, a total of 28672 NVIDIAN CUDA cores, and two dual 20-core Intel Xeon E5-2698 v4 2.2 Ghz processors. But what is the most interesting is that this GPU based system is claimed to be 250 times faster than a conventional CPU-only server for deep learning. I saw that there is also a similar product by IBM called the IBM Minsky, also equipped with NVidia GPUs.  This is especially interesting for those working on deep learning related topics.

NVidia DGX1

NVidia DGX1

Day 3 – main conference, excursion, banquet

On the third day of the conference, there was a keynote speech by Rakesh Agrawal, a senior researcher, who is one of the founder of data mining.  The talk was about the usage of social data.. The main question addressed in this talk was whether social data from websites such as Twitter is garbage or it can be useful for businesses.  R. Agrawal presented a project that he carried a few years ago at Microsoft where he analyzed Twitter data to study the opinion of people about Microsoft products. He also described a work where he compared the results of the Bing and Google search engines, and the result obtained when searching using social data rather than traditional search engines. The conclusion was that social data is certainly useful. R. Agrawal also gave some advices that young researchers should try to choose good research topics that are useful and can have an impact rather than just focusing on publishing a paper as quickly as possible.

Rakesh Agrawal PAKDD

On the afternoon, there was an excursion to Seopjikoji beach,  Seongsan Sunrise Peak and Seongeup Folk Village.

PAKDD excursion PAKDD excursion 2

In the evening, there was a banquet at the Seogwipo KAL hotel, with a musical performance.

Day 4 – main conference, closing ceremony

On the fourth day, there was a keynote by Dacheng Tao, from the University of Sydney Australia about current challenges in artificial intelligence. It was followed by several technical sessions, a lunch and a closing ceremony.


The conference was quite interesting. I had the occasion to meet many interesting people from academia and also from industry (e.g. Microsoft, Yahoo, Adobe, Nvidia). PAKDD is not the largest conference in data mining but it is a quite good conference, especially for the asia-pacific region, and the quality of the papers is quite high.  It was announced that next year, the PAKDD 2018 conference will be held in Melbourne, Australia. I will certainly try to attend it.

By the way, I had previously written a report about the PAKDD 2014 conference in Taiwan. You may have a look also at that report if you are interested by PAKDD.

Hope you have enjoyed this post!

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Academia, Big data, Conference, Data Mining, Data science | Tagged , , , | 2 Comments

Plagiarism by Bhawna Mallick and Kriti Raj at Galgotias College of Engineering & Technology

Today, I found that a paper written by Bhawna Mallick Kriti Raj and Himani from India is plagiarizing my papers. Those persons are affiliated to the  Galgotias College of Engg & Tech., India formerly known as Uttar Pradesh  Technical University.

The paper is called “Weather prediction using CPT+ algorithm” and was published in the International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 5 Issue 5 May 2016, Page No. 16467-16470. ( ).

Bhawna Mallick Galgotias College of Engineering & Technology

Plagiarism by Bhawna Mallick Uttar at Galgotias College of Engineering & Technology

The paper is an obvious case of plagiarism as it copies several pages of my PAKDD 2015 paper about the CPT+ algorithm, published a year before ( ).

The authors of the plagiarized paper,  Bhawna Mallick et al., claims to propose the CPT+ algorithm in their paper, and infringes our copyright by copying several pages of the paper, with figures and text, which is unacceptable.

Who is Bhawna Mallick, Kriti Raj  et al.?

I have done a little search to find who these persons are. According to the paper, their e-mails addresses are:

  • Bhawna Mallick. Galgotias College of Engg & Technology, Greater Noida, UP, India
  • Kriti Raj. Galgotias College of Engg & Technology, Greater Noida, UP, India
  • Himani.  Galgotias College of Engg & Technology, Greater Noida, UP, India

The website of the Galgotias College of Engineering & Technologycan be found here:

Galgotias college plagiarism

Galgotias college plagiarism

And it can be found that Bhawna Mallick et al are affiliated to the Department of Computer Science and Engineering.  In particular,  it is possible to find the following information about Bhawna Mallick the first author of the plagiarized paper:

Prof. (Dr.) Bhawna Mallick Professor & HOD Data Mining and Fuzzy Logic

Apparently, Bhawna Mallick is professor and head of the department.  Here is a screenshot of her webpage:

Bhawna Mallick Uttar Pradesh Plagiarism

Bhawna Mallick (Galgotias College of Engg & Technology) homepage

The fact that this person is head of that department and put her name as first author on a plagiarized paper raises serious questions about the quality of that college.

Also it raises questions about the quality of that journal.

And who is Kriti Raj?

According to this LinkedIn page, he is a student at the Galgotias College:

Kriti Raj Galgotia College Plagiarism

Kriti Raj Galgotia College Plagiarism


What will happen?

Well, this is not the first time that someone plagiarizes my papers. I stopped counting a while ago. I think that it has happened more than 10 times already in the last 7 years. So what I will do? I will send an e-mail to the Galgotias College of Engg & Technology to let them know about the situation, and I will ask them to take appropriate action. Moreover, I will ask that the journal paper be retracted  by contacting the journal editor.

Hopefully, the Galgotias College of Engg & Technology will take this problem of plagiarism seriously and take appropriate action to punish those responsible.  In India, this is not always the case. In another case of plagiarism that I had reported just a few months ago at the Ilahia College of Engineering and Technology, they have decided to simply ignore the e-mails and take no actions, which is very bad from an academic perspective.

Related posts:

Posted in Plagiarism | Tagged | 3 Comments

How to publish in top conferences/journals? (Part 1) – The Blue Ocean Strategy

A question that many young researchers ask is how to get your papers published in top conferences and journal.  There are many answers to this question. In this blog post, I will discuss a strategy for carrying research called the “Blue Ocean Strategy”.  This strategy was initially proposed in the field of marketing. But in this blog post, I will explain how it is also highly relevant in Academia.

The Blue Ocean Strategy was proposed in a 2007 book by Kim, C. W. and Mauborgne, R. The idea is quite simple. Let’s say that you want to start a new business and get rich. To start your business, you need to choose a market where your business will operate. Let’s say that you decide to start selling pens.  However, there are already a lot of pen manufacturers that are well-established and thus this market is extremely competitive and profit margins are very low. Thus, it might be very difficult to become successful in this market if you just try to produce pens like every other manufacturers. It is like jumping in a shark tank!

The Blue Ocean Strategy indicates that rather than fighting for some existing market, it is better to create some new markets (what is called a “blue ocean“). By creating a new market, the competition becomes irrelevant and you may easily get many new customers rather than fighting for a small part of an existing market. Thus, instead of trying to compete with some very well established manufacturer in a very competitive market (a “red ocean“), it is more worthy to start a new market (a “blue ocean”). This could be for example, a new type of pens that has some additional features.

Now let me explain how this strategy is relevant for academia.

In general,  there are two main types of research projects:

  • a researcher try to provide a solution to an existing research problem,
  • the researcher works on a new research problem.

The first case can be seen as a red ocean, since many researchers may be already working on that existing problem and it may be hard to publish something better. The second case is a blue ocean, since the researcher is the first one to work on a new topic. In that case, it can be easy to publish something since you do not need to do something better than other people, since you are the first one on that topic.

For example, I work in the field of data mining. In this field, many researchers work on publishing faster or more memory efficient algorithms for existing data mining problems. Although this research is needed, it can be viewed in some cases as lacking originality, and it can be very competitive to publish a faster algorithm.  On the other hand, if researchers instead work on proposing some new problems, then the research appears more original, and it becomes much more easy to publish an algorithm as it does not need to be more efficient than some previous algorithms.  Besides, from my observation, top conferences/journal often prefer papers on new problems to incremental work on existing problems.

Thus, it is not only easier to provide a solution to new research problem, but top conferences in some fields at least put a lot of value on papers that address new research problems. Thus, why fighting to be the best on an existing research problem?

Of course, there are some exceptions to this idea. If a researcher succeeds to publish some exceptional paper in a red ocean (on an existing research problem), his impact may actually be greater. This is especially if the research problem is very popular. But the point is that publishing in a red ocean may be harder than in a blue ocean.  And of course, not all blue oceans are equal. It is thus also important to find some good idea for new research topics (good blue oceans).

Personally, for these reasons, I generally try to work on “blue ocean” research projects.


In this blog post, I have discussed how the “Blue Ocean Strategy” and how it can be applied in academia to help in publishing in top conferences/journals. Of course, there are also a lot of other things to consider to write a good paper. You can read the follow up blog post on this topic here, where the opportunity cost of research is discussed.

If you like this blog and want to support it, please share it on social networks (Twitter, LinkedIn, etc.), write some comments, and continue reading other articles on this blog. 🙂

Related posts:

Posted in Academia, General, Research | Tagged , , , | 1 Comment

This is why you should visualize your data!

In the data science and data mining communities, several practitioners are applying various algorithms on data, without attempting to visualize the data.  This is a big mistake because sometimes, visualizing the data greatly helps to understand the data. Some phenomena are obvious when visualizing the data. In this blog post, I will give a few examples to convince you that visualization can greatly help to understand data.

An example of why using statistical measures may not be enough

The first example that I will give is a the Francis Anscombe Quartet.  It is a set of four datasets consisting of X, Y points. These four datasets are defined as follows:

Dataset I

Dataset II

Datset III

Dataset IV

































































































To get a feel of the data, the first thing that many  would do is to calculate some statistical measures such as the mean, average, variance, and standard deviation.  This allows to measure the central tendency of data and its dispersion. If we do this for the four above datasets, we obtain:

Dataset 1:   mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
Dataset 2:   mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
Dataset 3:   mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
Dataset 4:   mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125

So these datasets appears quite similar. They have exactly the same values for all the above statistical measures.  How about calculating the correlation between X and Y for each dataset to see how the points are correlated?

Dataset 1:   correlation 0.816
Dataset 2:  correlation 0.816
Dataset 3:  correlation 0.816
Dataset 4:  correlation 0.816

Ok, so these datasets are very similar, isn’t it?  Let’s try something else. Let’s calculate the regression line of each dataset (this means to calculate the linear equation that would best fit the data points).

Dataset 1:  y = 3.00 + 0.500x
Dataset 2:  y = 3.00 + 0.500x
Dataset 3:  y = 3.00 + 0.500x
Dataset 4:  y = 3.00 + 0.500x

Again the same!  Should we stop here and conclude that these datasets are the same?

This would be a big mistake because actually, these four datasets are quite different! If we visualize these four datasets with a scatter plot, we obtain the following:

Francis Anscombe Quartet

Visualization of the four datasets (credit: Wikipedia CC BY-SA 3.0)

This shows that these datasets are actually quite different. The lesson from this example is that by visualizing the data, difference sometimes becomes quite obvious.

Visualizing the relationship between two attributes

Simple visualizations techniques like scatter plots are also very useful for quickly analyzing the relationship between pairs of attributes in a dataset. For example, by looking at the two following scatter plots, we can quickly see that the first one show a positive correlation between the X and Y axis (when values on the X axis are greater, values on the Y axis are generally also greater), while the second one shows a negative correlation (when values on the X axis are greater, values on the Y axis are generally also smaller).

(a) positive correlation  (b) negative correlation (Credit: Data Mining Concepts and Techniques, Han & Kamber)

If we plot two attributes on the X and Y axis of a scatter plot and there is not correlation between the attributes, it may result in something similar to the following figures:

No correlation between the X and Y axis (Credit: Data Mining Concepts and Techniques, Han & Kamber)

These examples again show that visualizing data can help to quickly understand the data.

Visualizing outliers 

Visualization techniques can also be used to quickly identify outliers in the data. For example in the following chart, the data point on top can be quickly identified as an outlier (an abnormal value).

outlier scatter plot

Identifying outliers using a scatter plot

Visualizing clusters

In data mining, several clustering algorithms have been proposed to identify clusters of similar values in the data. These clusters can also often be discovered visually for low-dimensional data. For example, in the following data, it is quite apparent that there are two main clusters (groups of similar values), without applying any algorithms.

Two clusters

Data containing two obvious clusters


In this blog post, I have shown a few simple examples of how visualization can help to quickly see patterns in the data without actually applying any fancy models or performing calculations. I have also shown that statistical measures can actually be quite misleading if no visualization is done, with the classic example of the Francis Anscombe Quartet.

In this blog post, the examples are mostly done using scatter plots with 2 attributes at a time, to keep things simple. But there exists many other types of visualizations.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Big data, Data Mining, Data science | Tagged , , , , | 2 Comments

An Introduction to Sequential Pattern Mining

In this blog post, I will give an introduction to sequential pattern mining, an important data mining task with a wide range of applications from text analysis to market basket analysis.  This blog post is aimed to be a short introductino. If you want to read a more detailed introduction to sequential pattern mining, you can read a survey paper  that I recently wrote on this topic.

What is sequential pattern mining?

Data mining consists of extracting information from data stored in databases to understand the data and/or take decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and pattern mining. Pattern mining consists of discovering interesting, useful, and unexpected patterns in databases  Various types of patterns can be discovered in databases such as frequent itemsets, associations, subgraphs, sequential rules, and periodic patterns.

The task of sequential pattern mining is a data mining task specialized for analyzing sequential data, to discover sequential patterns. More precisely, it consists of discovering interesting subsequences in a set of sequences, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit. Sequential pattern mining has numerous real-life applications due to the fact that data is naturally encoded as sequences of symbols in many fields such as bioinformatics, e-learning, market basket analysis, texts, and  webpage click-stream analysis.

I will now explain the task of sequential pattern mining with an example. Consider the following sequence database, representing the purchases made by customers in a retail store.

sequence database

This database contains four sequences.  Each sequence represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items a and b together, then purchased an item c, then purchased items f and g together, then purchased an item g, and then finally purchased an item e.  

Traditionally, sequential pattern mining is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the frequent sequential patterns. For example, in the context of our example, sequential pattern mining can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

To do sequential pattern mining, a user must provide a sequence database and specify a parameter called the minimum support threshold. This parameter indicates a minimum number of sequences in which a pattern must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task of sequential pattern mining consists of finding all subsequences appearing in at least 2 sequences of the input database.  In the example database, 29  subsequences met this requirement. These sequential patterns are shown in the table below, where the number of sequences containing each pattern (called the support) is indicated in the right column of the table.sequential patterns

For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively.  The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3.   These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example. Sequential pattern mining can actually be applied on database containing hundreds of thousands of sequences.

Another example of application of sequential pattern mining is text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal of sequential pattern mining is then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in this context. If you want to know more about this application, you can read this blog post, where sequential patterns are discovered in a Sherlock Holmes novel.

Can sequential pattern mining be applied to time series?

Besides sequences, sequential pattern mining can also be applied to time series (e.g. stock data), when discretization is performed as a pre-processing step.  For example, the figure below shows a time series  (an ordered list of numbers) on the left. On the right, a sequence (a sequence of symbols) is shown representing the same data, after applying a transformation.   Various transformations can be done to transform a time series to a sequence such as the popular SAX transformation. After performing the transformation, any sequential pattern mining algorithm can be applied.

sequences and time series

Where can I get Sequential pattern mining implementations?

To try sequential pattern mining with your datasets, you may try the open-source SPMF data mining software, which provides implementations of numerous sequential pattern mining algorithms:

It provides implementations of several algorithms for sequential pattern mining, as well as several variations of the problem such as discovering maximal sequential patterns, closed sequential patterns and sequential rules. Sequential rules are especially useful for the purpose of performing predictions, as they also include the concept of confidence.

What are the current best algorithms for sequential pattern mining?

There exists several sequential pattern mining algorithms. Some of the classic algorithms for this problem are PrefixSpan, Spade, SPAM, and GSP. However, in the recent decade, several novel  and more efficient algorithms have been proposed such as CM-SPADE  and CM-SPAM (2014), FCloSM and FGenSM (2017), to name a few.  Besides, numerous algorithms have been proposed for extensions of the problem of sequential pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential pattern mining).


In this blog post, I have given a brief overview of sequential pattern mining, a very useful set of techniques for analyzing sequential data.  If you want to know more about this topic, you may read the following recent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequential pattern mining, extensions,  research challenges and opportunities.

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition, vol. 1(1), pp. 54-77.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Big data, Data Mining, Data science | Tagged , , , , , , , | 7 Comments

An Introduction to Data Mining

In this blog post, I will introduce the topic of data mining. The goal is to give a general overview of what is data mining.

what is data mining

What is data mining?

Data mining is a field of research that has emerged in the 1990s, and is very popular today, sometimes under different names such as “big data” and “data science“, which have a similar meaning. To give a short definition of data mining,  it can be defined as a set of techniques for automatically analyzing data to discover interesting knowledge or pasterns in the data.

The reasons why data mining has become popular is that storing data electronically has become very cheap and that transferring data can now be done very quickly thanks to the fast computer networks that we have today. Thus, many organizations now have huge amounts of data stored in databases, that needs to be analyzed.

Having a lot of data in databases is great. However, to really benefit from this data, it is necessary to analyze the data to understand it. Having data that we cannot understand or draw meaningful conclusions from it is useless. So how to analyze the data stored in large databases?  Traditionally, data has been analyzed by hand to discover interesting knowledge. However, this is time-consuming, prone to error, doing this may miss some important information, and  it is just not realistic to do this on large databases.  To address this problem, automatic techniques have been designed to analyze data and extract interesting patterns, trends or other useful information. This is the purpose of data mining.

In general, data mining techniques are designed either to explain or understand the past (e.g. why a plane has crashed) or predict the future (e.g. predict if there will be an earthquake tomorrow at a given location).

Data mining techniques are used to take decisions based on facts rather than intuition.

What is the process for analyzing data?

To perform data mining, a process consisting of seven steps is usually followed. This process is often called the “Knowledge Discovery in Database” (KDD) process.

  1. Data cleaning: This step consists of cleaning the data by removing noise or other inconsistencies that could be a problem for analyzing the data.
  2. Data integration: This step consists of integrating data  from various sources to prepare the data that needs to be analyzed. For example, if the data is stored in multiple databases or file, it may be necessary to integrate the data into a single file or database to analyze it.
  3. Data selection: This step consists of selecting the relevant data for the analysis to be performed.
  4. Data transformation: This step consists of transforming the data to a proper format that can be analyzed using data mining techniques. For example, some data mining techniques require that all numerical values are normalized.
  5. Data mining:  This step consists of applying some data mining techniques (algorithms) to analyze the data and discover interesting patterns or extract interesting knowledge from this data.
  6. Evaluating the knowledge that has been discovered: This step consists of evaluating the knowledge that has been extracted from the data. This can be done in terms of objective and/or subjective measures.
  7. Visualization:  Finally, the last step is to visualize the knowledge that has been extracted from the data.

Of course, there can be variations of the above process. For example, some data mining software are interactive and some of these steps may be performed several times or concurrently.

What are the applications of data mining?

There is a wide range of data mining techniques (algorithms), which can be applied in all kinds of domains where data has to be analyzed. Some example of data mining applications are:

  • fraud detection,
  • stock market price prediction,
  • analyzing the behavior of customers in terms of what they buy

In general data mining techniques are chosen based on:

  • the type of data to be analyzed,
  • the type of knowledge or patterns to be extracted from the data,
  • how the knowledge will be used.

What are the relationships between data mining and other research fields?

Actually, data mining is an interdisciplinary field of research partly overlapping with several other fields such as: database systems, algorithmic, computer science, machine learning, data visualization, image and signal processing and statistics.

There are some differences between data mining and statistics although both are related and share many concepts.  Traditionally, descriptive statistics has been more focused on describing the data using measures, while inferential statistics has put more emphasis on hypothesis testing to draw significant conclusion from the data or create models. On the other hand, data mining is often more focused on the end result rather than statistical significance. Several data mining techniques do not really care about statistical tests or significance, as long as some measures such as profitability, accuracy have good values.  Another difference is that data mining is mostly interested by automatic analysis of the data, and often by technologies that can scales to large amount of data. Data mining techniques are sometimes called “statistical learning” by statisticians.  Thus, these topics are quite close.

What are the main data mining software?

To perform data mining, there are many  software programs available. Some of them are general purpose tools offering many algorithms of different kinds, while other are more specialized. Also, some software programs are commercial, while other are open-source.

I am personally, the founder of the SPMF open-source data mining library, which is free and open-source, and specialized in discovering patterns in data. But there are many other popular software such as Weka, Knime, RapidMiner, and the R language, to name a few.

Data mining techniques can be applied to various types of data

Data mining software are typically designed to be applied on various types of data. Below, I give a brief overview of various types of data typically encountered, and which can be analyzed using data mining techniques.

  • Relational databases:  This is the typical type of databases found in organizations and companies. The data is organized in tables. While, traditional languages for querying databases such as SQL allow to quickly find information in databases, data mining allow to find more complex patterns in data such as trends, anomalies and association between values.
  • Customer transaction databases: This is another very common type of data, found in retail stores. It consists of transactions made by customers. For example, a transaction could be that a customer has bought bread and milk with some oranges on a given day. Analyzing this data is very useful to understand customer behavior and adapt marketing or sale strategies.
  • Temporal data: Another popular type of data is temporal data, that is data where the time dimension is considered. A sequence is an ordered list of symbols. Sequences are found in many domains, e.g. a sequence of webpages visited by some person, a sequence of proteins in bioinformatics or sequences of products bought by customers.  Another popular type of temporal data is time series. A time series is an ordered list of numerical values such as stock-market prices.
  •  Spatial data: Spatial data can also be analyzed. This include for example forestry data, ecological data,  data about infrastructures such as roads and the water distribution system.
  • Spatio-temporal data: This is data that has both a spatial and a temporal dimension. For example, this can be meteorological data, data about crowd movements or the migration of birds.
  • Text data: Text data is widely studied in the field of data mining. Some of the main challenges is that text data is generally unstructured. Text documents often do no have a clear structure, or are not organized in predefined manner. Some example of applications to text data are (1) sentiment analysis, and  (2) authorship attribution (guessing who is the author of an anonymous text).
  • Web data: This is data from websites. It is basically a set of documents (webpages) with links, thus forming a graph. Some examples of data mining tasks on web data are: (1) predicting the next webpage that someone will visit, (2) automatically grouping webpages by topics into categories, and (3) analyzing the time spent on webpages.
  • Graph data: Another common type of data is graphs. It is found for example in social networks (e.g. graph of friends) and chemistry (e.g. chemical molecules).
  • Heterogeneous data. This is some data that combines several types of data, that may be stored in different format.
  • Data streams: A data stream is a high-speed and non-stop stream of data that is potentially infinite (e.g. satellite data, video cameras, environmental data).  The main challenge with data stream is that the data cannot be stored on a computer and must thus be analyzed in real-time using appropriate techniques. Some typical data mining tasks on streams are to detect changes and trends.

What types of patterns can be found in data?

As previously discussed, the goal of data mining is to extract interesting patterns from data. The main types of patterns that can be extracted from data are the following (of course, this is not an exhaustive list):

  • Clusters: Clustering algorithms are often applied to automatically group similar instances or objects in clusters (groups).  The goal is to summarize the data to better understand the data or take decision. For example, clustering techniques can be used to automatically groups customers having a similar behavior.
  • Classification models: Classification algorithms aims at extracting models that can be used to classify new instances or objects into several categories (classes). For example, classification algorithms such as Naive Bayes, neural networks and decision trees can be used to build models that can predict if a customer will pay back his debt or not, or predict if a student will pass or fail a course.
  • Patterns and associations: Several techniques are developed to extract frequent patterns or associations between values in database. For example, frequent itemset mining algorithms can be applied to discover what are the products frequently purchased together by customers of a retail store.
  •  Anomalies/outliers: The goal is to detect things that are abnormal in data (outliers or anomalies). Some applications are for example: (1) detecting hackers attacking a computer system, (2) identifying potential terrorists based on suspicious behavior, and (3) detecting fraud on the stock market.
  • Trends, regularities:  Techniques can also be applied to find trends and regularities in data.  Some applications are for example to (1) study patterns in the stock-market to predict stock prices and take investment decisions, (2) discovering regularities to predict earthquake aftershocks, (3) find cycles in the behavior of a system, (4) discover the sequence of events that lead to a system failure

In general, the goal of data mining is to find interesting patterns. As previously mentioned, what is interesting can be measured either in terms of objective or subjective measures. An objective measure is for example the occurrence frequency of a pattern (whether it appears often or not), while a subjective measure is whether a given pattern is interesting for a specific person. In general, a pattern could be said to be interesting if: (1) it easy to understand, (2) it is valid for new data (not just for previous data); (3) it is useful, (4) it is novel or unexpected (it is not something that we know already).


In this blog post, I have given a broad overview of what is data mining. This blog post was quite general. I have actually written it because I am teaching a course on data mining and this will be some of the content of the first lecture. If you have enjoyed reading, you may subscribe to my Twitter feed (@philfv) to get notified about future blog posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Data Mining, Data science, General | 3 Comments

Write more papers or write better papers? (quantity vs quality)

In this blog post, I will discuss an important question for young researchers, which is: Is it better to try to write more papers  or to try to write fewer but better papers?  In other words, what is more important: quantity or quality in research?

To answer this question, I will first explain why quantity and quality are important, and then I will argue that a good trade-off needs to be found.


There are several reasons why quantity is important:

  • Quantity shows that someone is productive and can have a consistent research output. For example, if someone has published 4 papers each year during the last four years, it approximately shows what can be expected from that researcher in terms of productivity for each year. However, if a researcher has an irregular research output such as zero papers during a few years, it may raise questions about the reasons why that researcher did not write papers. Thus writing more show to other people that you are more active.
  • Quantity is correlated with research impact.  Even though, writing more papers does not means that the papers are better, some studies have shown a strong correlation between the number of papers and the influence of researchers in their field. Some of reasons may be that (1) writing more papers improve your visibility in your field and your chances of being cited, (2) if you are more successful, you may obtain more resources such as grants and funding, which help you to write more papers, and (3) writing more may improve your writing skills and help you to write more and better papers.
  • Quantity is used to calculate various metrics to evaluate the performance of researchers.  In various countries and institutions, metrics are used to evaluate the research performance of researchers. These metrics include for example: the number of papers and the number of citations. Although metrics are imperfect, they are often used for evaluating researchers because they allow to quickly evaluate a researcher without reading each of his publications.  Metrics such as the number of citations are also used on some website such as Google Scholar to rank articles.


The quality of papers is important for several reasons:

  • Quality shows that you can do excellent research. It is often very hard to publish in top level journals or conferences. For example, some conferences have an acceptance rate of 5 % or even less, which means that out of 1000 submitted papers, only 50 are accepted.  If you can get some papers in top journals and conferences, it shows that you are among the best researchers in your field. On the contrary, if someone only publish papers in weak and unknown journals and conferences, it will raise doubts about the quality of the research, and about his ability at doing research. Publishing in some unknown conference/journals can be seen as something negative that may even decrease the value of a CV.
  • Quality is also correlated with research impact. A paper that is published in a top conference or journal has more visibility and therefore has more chance of being cited by other researchers. On the contrary, papers published in small or unknown conferences have more chance of not being cited by other researchers.

A trade-off

So what is the best approach?  In my opinion, both quantity and quality are important. It is especially important to write several papers for young researchers to kickstart their career and fill their CV to apply for grants and obtain their diplomas. But having  some quality papers is also necessary .  Having a few good papers in top journals and conferences can be worth much more than having many papers in weak conferences.  For example, in my field, having a paper in a conference like KDD or ICDM could be worth more than 5 or 10 papers in smaller conferences.  But the danger of putting too much emphasis on quality is that the research output may become very low if the papers are not accepted.  Thus, I believe that the best approach is to use a trade-off: (1) once in a while write some very high quality papers and try to get them published in top journals and conferences, (2) but sometimes write papers for easier journals and conference to increase the overall productivity, and get some papers published.

Actually, a researcher should be able to evaluate whether a given research project is suitable for a high level conference/journal or not based on the merit of the research, and whether the research needs to be published quickly (for very competitive topics). Thus, a researcher should decide for each paper whether it should be submitted to a high level conference/journal or something easier.

But, there should always be a minimum quality requirement for papers. Publishing bad papers or publishing very weak papers can have a negative influence on your CV and even look bad. Thus, even when considering quantity, one should ensure that a minimum quality requirement is met. For example, since my early days as researchers, I have set a minimum quality requirements that all my papers be at least published by a well-known publisher among ACM, IEEE, Springer, Elsevier, and be indexed in DBLP (an index for computer science). For me, this is the minimum quality requirement but I will often aim at good or excellent confernce/journal depending on the projects.

Hope that you have enjoyed this post. If you like it, you can continue reading this blog, and subscribe to my Twitter ( @philfv ).

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Academia, General, Research | Leave a comment