An introduction to frequent subgraph mining

In this blog post, I will give an introduction to an interesting data mining task called frequent subgraph mining, which consists of discovering interesting patterns in graphs. This task is important since data is naturally represented as graph in many domains (e.g. social networks, chemical molecules, map of roads in a country). It is thus desirable to analyze graph data to discover interesting, unexpected, and useful patterns, that can be used to understand the data or take decisions.

What is a graph? A bit of theory…

But before discussing the analysis of graphs, I will introduce a few definitions.  A graph is a set of vertices and edges, having some labels. Let’s me illustrate this idea with an example. Consider the following graph:

This graph contains four vertices (depicted as yellow circles). These vertices have labels such as “10” and “11”.  These labels provide information about the vertices. For example, imagine that this graph is a  chemical molecule. The label 10 and 11 could represent the chemical elements of Hydrogen and Oxygen, respectively. Labels do not need to be unique. In other words, the same labels may be used to describe several vertices in the same graph. For example, if the above graph represents a chemical molecule, the labels “10” and “11” could be used for all vertices representing Oxygen and Hydrogen, respectively.

Now, besides vertices, a graph also contains edges. The edges are the lines between the vertices, here represented by thick black lines.  Edges also have some labels. In this example, four labels are used, which are 20, 21, 22 and 23.  These labels represents different types of relationships between vertices. Edge labels do not need to be unique.

Types of graphs: connected and disconnected 

Many different types of graphs can be found in real-life. Graphs are either connected or disconnected.  Let me explain this with an example. Consider the two following graphs:

Connected and disconnected graphs

The graph on the left is said to be a connected graph because by following the edges, it is possible to go from any vertex to any other vertices. For example, imagine that vertices represents cities and that the edges are roads between cities. A connected graph in this context is a graph where it is possible to go from any city to any other cities by following the roads.  If a graph is not connected, it is said to be a disconnected graph. For example, the graph on the right is disconnected since Vertex A cannot be reached from the other vertices by following the edges. In the following, we will use the term “graph” to refer to connected graphs.  Thus, all the graphs that we will discuss in the following paragraphs will be connected graphs.

Types of graphs: directed and undirected

It is also useful to distinguish between directed and undirected graphs. In an undirected graph, edges are bidirectional, while in a directed graph, the edges can be unidirectional or bidirectional. Let’s illustrate this idea with an example.

directed and undirected graphs

The graph on the left is undirected, while the graph on the right is directed. What are some real-life examples of a directed graph?  For example, consider a graph where vertices are locations and edges are roads. Some roads can be travelled in both directions while some roads may be travelled in only a single direction  (“one-way” roads in a city).

Some data mining algorithms are designed to work only with undirected graphs, directed graphs, or support both.

Analyzing graphs

Now that we have introduced a bit of theory about graphs, what kind of data mining task can we do to analyze graphs?  There are many  answers to this question. The answer depends on what is our goal but also on the type of graph that we are analyzing (directed/undirected, connected/disconnected,  a single graph or many graphs).

In this blog post, I will explain a popular task called frequent subgraph mining. The goal of subgraph mining is to discover interesting subgraph(s) appearing in a set of graphs (a graph database). But how can we judge if a subgraph is interesting?  This depends on the application. The interestingness can be defined in various ways. Traditionally, a subgraph has been considered as interesting if it appears multiple times in a set of graphs. In other words, we want to discover subgraphs that are common to multiple graphs. This can be useful for example to find association between chemical elements common to several chemical molecules.

The task of finding frequent subgraphs in a set of graphs is called  frequent subgraph mining.  As input the user must provide:

  • a graph database (a set of graphs)
  • a parameter called the minimum support threshold (minsup).

Then, a frequent subgraph mining algorithm will enumerate as output all frequent subgraphs. A frequent subgraph is a subgraph that appears in at least minsup graphs from a graph database.  For example, let’s consider the following graph database containing three graphs:

A graph database

Now, let’s say that we want to discover all subgraphs that appear in at least three graphs. Thus, we will set the minsup parameter to 3. By applying a frequent subgraph mining algorithm, we will obtain the set of all subgraphs appearing in at least three graphs:

frequent subgraphs

Consider the third subgraph (“Frequent subgraph 3”).  This subgraph is frequent and is said to have a support (a frequency) of 3 since it appears in three of the input graphs. These occurrences are highlighted in red, below:

Frequent subgraph 3

Now a good question is how to set the minsup parameter? In practice, the minsup parameter is generally set by trial and error.  If this parameter is set too high, few subgraphs will be found, while if it is set too low, hundred or millions of subgraphs may be found, depending on the input database.

Now, in practice, which tools or algorithms can be used to find frequent subgraphs? There exists various frequent subgraph mining algorithms. Some of the most famous are GASTON, FSG, and GSPAN.

Mining frequent subgraphs in a single graph

Besides discovering graphs common to several graphs, there is also a variation of the problem of frequent subgraph mining that consists of finding all frequent subgraphs in a single graph rather than in a graph database. The idea is almost the same. The goal is also to discover subgraphs that appear frequently or that are interesting. The only difference is how the support (frequency) is calculated. For this variation, the support of a subgraph is the number of times that it appears in the single input graph. For example, consider the following input graph:

A single graph

This graph contains seven vertices and six edges. If we perform frequent subgraph mining on this single graph by setting the minsup parameter to 2, we can discover the five following frequent subgraphs:

frequent subgraphs in a graph

These subgraphs are said to be frequent because they appear at least twice in the input graph. For example, consider “Frequent subgraph 5”. This subgraph has a support of 2 because it has two occurrences in the input graph. Those two occurrences are highlighted below in red and blue, respectively.

the frequent subgraph 5

Algorithms to discover patterns in a graph database can often be adapted to discover patterns in a single graph.


In this blog post, I have introduced the problem of frequent subgraph mining, which consists of discovering subgraphs appearing frequently in a set of graphs.  This data mining problem has been studied for more than 15 years, and many algorithms have been proposed.  Some algorithms are exact algorithms (will find the correct answer), while some other are approximate algorithms (do not guarantee to find the correct answer, but may be faster).

Some algorithms are also designed to handle directed or undirected graphs, or mine subgraphs in a single graph or in a graph database, or can do both. Besides, there exists several other variations of the subgraph mining problem such as discovering frequent paths in a graph, or frequent trees in a graph.

Besides, in data mining in general, many other problems are studied related to graphs such as optimization problems, detecting communities in social networks, relational classification, etc.

In general, problems related to graphs are quite complex compared to some other types of data. One of the reason why subgraph mining is difficult is that algorithms typically need to check for “subgraph isomorphisms“, that is to compare subgraphs to determine if they are equivalent. But nonetheless, I think that these problems are quite interesting as there are several research challenges.

I hope that you have enjoyed this blog post. If there is some interest about this topic, I may do another blog post on graph mining in the future.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Big data, Data Mining, Data science | Tagged , , , , , , | 9 Comments

We are launching a new data mining journal

In this blog post, I will discuss one of my recent and current project. I have been recently working with my colleague Chun-Wei Lin on launching a new journal, titled “Data Science and Pattern Recognition“.

Data Science and Pattern Recognition

This is a new open-access journal, with a focus on data science, big data, and pattern recognition. The journal will be published by the Taiwanese publisher Ubiquitous International, which already publishes several SCI journals. In the next year, we will be pushing hard to make this journal a high level journal. For this purpose, we have worked hard on creating an outstanding editorial board with world-class researchers in the fields of data science and pattern recognition.

It is our goal to make the journal EI and SCI indexed in the next years.  Currently, there is no publication fee for publishing in this journal, to help support the journal initially. But there should be some in the future, since it is an open-access journal. Thus, it is a good time to submit your papers, now!

There will be 4 issues per year, for a total of approximately 24 papers per year. The first issue has just been published. It contains five papers including a paper that I wrote.

The next issue is planned for June, so if you want to submit a paper, please send it before June. After that, the following issues should be in September and December. In any case, if you have any question about the journal, you can have a look at the website, let me know directly, or use the contact e-mail on the journal website to contact us.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Big data, Data Mining, Data science, Research | Tagged , , , | 2 Comments

What is the job of a university professor?

In this blog post, I will discuss the job of university professor. And, I will discuss why I have chosen to become one. This post is especially aimed at students who are considering working in academia after their Ph.D.

A professor

What is the job of university professor?

When I was an undergraduate student, I did not know much about what the professors were really doing at my university beside teaching.  In general, there are three main tasks that a professor must do:

  • Teaching. This consists of teaching courses for students. This is a very important task since it is the reason for having professors in universities. The numbers of courses that a professor teaches every year can vary greatly depending on the university and the rank of the professor. Some university are known to put more emphasis on research, while other put more emphasis on teaching. When I was working as a professor in Canada, I first started with about 220 hours (5 courses) / year, which did not give me much free time for doing research. Then after receiving some funding and having several graduate students in my team, my teaching load was reduced to about 3 courses / year to let me do more research.  Now in China, in 2016, I have taught about 100 hours.
  • Research. The second task that a professor must do is to carry innovative research that lead to the publication of research papers. Why a professor is doing research? Some key reasons are to attract top talents in university, and to ensure that the professors stay active and keep their knowledge up to date. Now, what is the difference between the research that a professor does and the research that a graduate student do?  The main difference is that a professor is expected to carry a long-term research program and have several students carrying research in his team. It is a bit like going from the job of an employee to being a team leader or business manager, in the sense that the professor has to manage not only his own research projects but generally also the research of a team, and have a clear plan for the next years to come. Many of the tasks related to research that a professor does is related to the management of his team. For example, a professor typically has to apply for funding from the government, or find projects with companies to supply funds to his research team. This may involve writing numerous grant applications and attend many meetings.
  • Service to the community. The third task that a professor must do is to provide service to the community. This can be various activities at the university level, national level, or international level. For example, when I was a professor in Canada, I was involved in a programming competition for high schools, and recruiting students for our university. I also helped to organize a LAN party for students every year, and other activities with students. Another task that I was doing was to evaluate the applications of the master degree students applying to the university in our program. This latter task was consuming many hours of my time every week.  At a national and international level, my service to the community has included tasks such as reviewing articles for conferences and journals, being the editor of a journal, and the founder of an open-source data mining software.

Thus, to be a professor, one should enjoy doing these three tasks. If one only enjoys teaching but does not enjoy research, then perhaps that it is best to become a lecturer (a professor not carrying research – in the North-American system of education). Or if one only enjoy research but does not enjoy teaching, then it is probably best to become a researcher rather than a professor, and work in the industry. But there are some exceptions. Nonetheless, a professor should ideally enjoy doing all these tasks, and should do them well.

What are the advantages/disadvantages of being a professor?

As for any other jobs, there are advantages and disadvantages for choosing this job. Let’s analyze this:

  • Salary: Depending on the country and university, the salary of a university professor can be quite high. But it is generally lower than working as a researcher in the industry. Thus, the motivation should not be only about money.
  • Schedule: One of the greatest advantage and disadvantage of being a professor is the work schedule. A professor may be extremely busy. He may have to work during week-ends, evenings, and often more than 10 hours a day to keep up with research, teaching and other tasks. The first few years of being a professor can be really hard in terms of schedule. The reason is that a new professor typically has to teach, prepare many new courses and apply for funding, and setup his research team. This is quite different from the life of a Ph.D student who often can concentrate on only doing research. After a few years, the job of a professor becomes easier.  However, although the schedule of a professor is very busy, what I like about it is the freedom about how to organize my time. A professor may typically decide to work at home if he is not teaching and may decide to wake up late but finish working late a night.  This is different from working in a company.
  • Traveling: Another thing that I like about being a professor is the opportunity for traveling.  A professor typically has to attend international conferences to present his research and meet other researchers, and may also visit other universities for collaborations. This is interesting from the research perspective. But it is also interesting from a personal perspective. Of course, it depends on the funding. Not every professor has funding for attending international conferences and travelling.
  • Being your own boss: Another advantage of being a professor is that a professor generally has a lot of freedom for the topic of his research. A professor may decide to work on topics that he like rather than work on the projects of other people (as someone would typically do, when working in a company). I often think of being a professor as someone running a business. The professor must decide of the research directions and manage his team. I enjoy this freedom of being able to work on the topics that I like, and also to publish the result of my research freely as open-source programs that can benefit anyone.

What is required to become a university professor?

In terms of academic skills, good universities require to have a Ph.D. degree. But having a PhD. is often not enough. It is also generally required to have several publications in good conferences and journals.  Now, if we analyze this in more details, a professor should have the following skills:

  • teaching: A professor must be able to explain concepts in simple ways, clearly, and concisely so that students can learn efficiently.  A professor must also be able to make the classes enjoyable rather than boring, and prepare their courses well, assignments and exams.
  • Carrying research:  A professor must be good at doing research.  This includes being able to find interesting research questions and find innovation solutions to solve these problems.
  • writing: A professor must also be good at writing. This is important as a professor must write journal papers, conference papers, grant applications, and many other documents. Being good at writing is related to being good at teaching, since writing often requires to explain concepts in simple ways so that the reader can understand (just like teaching).
  • Managing a team/sociability: A professor should be able to manage a team and also ideally establish collaborations with other researchers. Thus some management and social skills are required.
  • Being good in his field: A professor must also be good in his field of study. For example, for the field of computer science, a professor should ideally also be a good programmer. This is different from being a good researcher, as a good programmer is not necessarily a good researcher, and a good researcher is not necessarily a good programmer.
  • Being a workaholic 😉 :  actually, not everybody can work or likes to work 10 hours or more every day. And also not every professor work very hard but to become a professor it still generally requires a huge amount of work. And the first years of being a professor can be quite hard. When I was an undergraduate student, I saw it as a challenge for me to see how far I would be able to go into academia, and since I like working, the amount of work did not put me away. I have worked very hard in the last 10 years . For example, during my studies, I typically just took a few days off during a whole year, and worked every day from the morning until the evening. For me, it is worth it.

It might seem like a long list of skills to have. But actually, all these skills can be developed over time. When I was an undergraduate students for example, I did not know how to write well. I remember that the first paper that I attempted to write by myself during my master degree was quite terrible. Over the years during my master’s degree and Ph.D., I have greatly improved my writing skills by practicing writing many papers. I also greatly improved my research skills, and teaching skills. In terms of teaching, it requires some practice and dedication to become a good teacher.


So why did I choose to become a professor? The short answer is that I really like research and the freedom of doing my own research. I also like to be able to manage my schedule, travelling, and I also enjoy teaching.  Would I move to the industry someday? No. Even if I could probably earn more in the industry, I am happy to do what I am doing in academia.

If you have enjoyed reading this blog post, you can continue reading this blog, which has many other posts related to academia and data mining.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Related posts:

Posted in Academia, General, Research | Tagged , , , | Leave a comment

Introduction to clustering: the K-Means algorithm (with Java code)

In this blog post, I will introduce the popular data mining task of clustering (also called cluster analysis).  I will explain what is the goal of clustering, and then introduce the popular K-Means algorithm with an example. Moreover, I will briefly explain how an open-source Java implementation of K-Means, offered in the SPMF data mining library can be used.

What is clustering?

Clustering is one of the most fundamental data mining tasks.

Consider that you have a database, containing several records (which we will call instances). For example, consider the following database describing several bank customers.

25      Canada          3000     M         50
42      Brazil          4000     F         500

In this example, each instance (customer) is described according to some attributes such as agenationalityincome, gender, and  account_balance. 

The goal of clustering is to automatically find groups of instances (e.g. customers) that are similar in a database. For example, for the database of bank customers, clustering the customers, consists of automatically grouping the customers that have a similar profile. For example, a clustering algorithm could perhaps find that many female customers who are rich have a similar profile or that many young Canadians with a low income have a similar profile (this is just an example).

Thus, the goal of clustering is to find groups of instances having similar characteristics in a database. These groups of instances found by clustering are called clusters. Thus, the goal of clustering is to find clusters of similar instances.

Let’s have a look at a second example before giving a more formal definition of what is clustering.  This time, we will consider a database of  2D points.  Consider the following database containing 31 two dimensional points.

Instance 1 (1,1) Instance 17 (16, 16)
Instance 2 (0,1) Instance 18 (11.5, 8)
Instance 3 (1,0) Instance 19 (13, 10)
Instance 4 (11,12) Instance 20 (12, 13)
Instance 5 (11, 13) Instance 21 (14, 12.5)
Instance 6 (13, 13) Instance 22 (14.5, 12.5)
Instance 7 (12, 8.5) Instance 23 (14.5, 11.5)
Instance 8 (13, 8) Instance 24 (15, 10.5)
Instance 9 (13, 9) Instance 25 (15, 9.5)
Instance 10 (13, 7) Instance 26 (12, 9.5)
Instance 11 (11, 7) Instance 27 (10.5, 11)
Instance 12 (8, 2) Instance 28 (10, 10.5)
Instance 13 (9, 2) Instance 29 (9, 3)
Instance 14 (10, 1) Instance 30 (9, 4)
Instance 15 (7, 13) Instance 31 (9,5)
Instance 16 (5, 9)

Each instance (point) in this database is described by two attributes:  the X and Y coordinates.  This database can be represented visually usually a XY chart:

A database of 2D points

It can be observed in these figures that some points are quite close, while some other points are quite far away. In this context, the goal of clustering is to find groups of points that are similar (close to each other).  By applying a clustering algorithm such as K-Means, three clusters could be found (represented by the blue, red and green colors):

Three clusters of points

Intuitively, these clusters somewhat make sense as they are groups of points that are close to each other.

Clustering has many applications. It can be used for example to cluster similar documents in categories (for example, automatically classifying news on a website into categories).

Clustering algorithms

To perform clustering, it is necessary to apply a clustering algorithm. There exists hundreds of clustering algorithms to find hidden clusters in data.  Different algorithms have different advantages and disadvantages and can be more or less suitable to different types of data.  Some of the most popular algorithms are for example K-Means and DBScan.

In the following paragraphs, I will briefly explain how the K-Means algorithms works.

The K-Means algorithm

The K-Means algorithm was proposed in 1967 by MacQueen.  This algorithm has two main parameters:  (1) a database,  (2)  a positive integer K representing the number of clusters to be extracted from the database.

The K-Means algorithm consists of the following steps:

(1) The algorithm reads the database in memory. The database contains several instances.

(2) The algorithm initialize K empty clusters. Each cluster has a prototype, which is a randomly generated instance.

(3)  Each instance in the database is assigned to the cluster  having the closest prototype.

(4) Then, the prototype of each cluster is recomputed as the average of all the instances in that cluster.

(5) Then, Step3 and Step 4 are repeated several times, until the clusters become stable.

Now let me illustrate these steps with an example so that it becomes clear. By applying K-Means on the database of 2D points, (1) K-Means will first load the database of 31 points in memory. Then, assume that the user has set K = 3 to generate 3 clusters. (2) K-Means will initially create three empty clusters which will have some random points as prototypes.  Let say that these three points (prototypes) are (2,2),  (16, 4) and (11, 0) .  We can represent this visually as follows (where the prototypes are represented by the X symbol):

Three cluster prototypes
Then (3) K-means will assign each of the 31 points to the cluster having the closest prototype. The result will be as follows:

Initial K-Means clusters

Then (4) the prototype of each cluster is recomputed as the average of the points that it contains. The new prototypes are approximately (1.75, 2.75), (9.1, 5.2) and (12.9, 10.7) for the three clusters of points. 

Updated K-Means clusters

Then (3), each point is assigned to the cluster having the closest prototype. The result is the clusters shown in the picture below. As you can see, some points have moved from one cluster to another.

New K-Means Clusters

Then, step (4) is applied again to recompute the prototype of each cluster as the average of its points. The new prototypes are (0.7, 0.7), (12.4,10.8) and  (8.7,4.1).

Then step (3) is applied again. K-Means assigns each point to the cluster having the closest prototype. Since the cluster did not change after this step, the K-Means algorithms stop and the final result is the following three clusters, here displayed with colors:

Final set of K-Means cluster

An open-source Java implementation of K-Means

If you want to try the K-Means algorithm with the above example by yourself,  a Java implementation of K-Means is provided in the SPMF library.

To run the example, you will need to have Java 1.8 or higher installed on your computer.

(1) The first step is to download the database of points, which can be obtained here:

(2) Then, you will need to download the release version of SPMF, which is a file called spmf.jar, and can be obtained from the download page of SPMF.

(3) Then, double-click on the file spmf.jar to launch SPMF. If it does not work, it means that there is a problem with the Java installation on your computer.  If it works correctly, the user interface of SPMF should open:

  • (A)  choose the KMeans algorithm in the combo box.
  • (B)  choose the input file  as  “inputDBScan2.txt”
  • (C) set the output file as “text.txt”. The output of the algorithm will be written to that file.
  • (D) Set the parameter K to 3
  • (E) To be able to calculate how similar the instances are close to each other, we need to specify a distance function. For this example, we should use the Euclidian distance, as it is the typical distance function for comparing 2D points.
  • (F) Select the “cluster viewer” to see the results as a chart.
  • (G) Click the “Run algorithm” button” to launch the K-Means algorithm.

The result will be displayed in a window, such as this:

Using K-Means in SPMF

Note that it is quite possible that the result that you will obtain is not exactly the same. The reason is that K-Means is a randomized algorithm. In other words, K-Means utilize random numbers. Thus, if K-Means is run several times, it may not always generate the same result.

Why is the K-Means algorithm popular?

Having presented the K-Means algorithm, let’s now briefly discuss its characteristics in more details. Why is K-Means popular?  The main reason is that it is a very simple algorithm. It is easy to implement and also easy to understand.

Some drawback of the K-Means algorithms is that the final result depends on how the initial prototypes are selected randomly. If we are lucky, we may find some nice clusters as in the above example, that somewhat make sense. However, if the initial points are not chosen in an appropriate way, the result may not be meaningful. There exists various solutions to this problem such as running K-Means several times, and then to choose the best set of clusters.

Another limitation of K-Means is that the user must explicitly specify the number of clusters to be found (the K parameter). But finding the best value for the K parameter may require to try several values.

To address these limitations various extensions of K-Means have been proposed. Besides, many other clustering algorithms have been proposed.


This blog post has given an overview of the task of clustering, a fundamental data mining task. The goal was to give a simple introduction. The post has explained how the K-Means algorithm works and has shown how to use the SPMF implementation of K-Means. In a future blog post, I will explain the DBScan clustering algorithm, which is also offered in SPMF.

For the interested reader, a good introduction to clustering can be found in the book “Introduction to data mining” by Tan, Steinbach and Kumar. The first chapter about clustering is freely available here.

Philippe Fournier-Viger is a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Related posts:

Posted in Big data, Data Mining, Data science, Open-source | Tagged , , , , , , | 2 Comments

Happy New Year!

To all those reading this blog and/or using the SPMF library, I wish you a Merry Christmas and Happy new year!

Data mining open source christmas

Related posts:

Posted in General | Leave a comment

Introduction to time series mining with SPMF

This blog post briefly explain how time series data mining can be performed with the Java open-source data mining library SPMF (v.2.06).  It first explain what is a time series and then discuss how data mining can be performed on time series.

What is a time series? What is the difference with sequences?

There are two main types of sequential data considered in data mining:  sequences and time series.

A time-series is an ordered list of numbers. For example, the figure below shows a time-series. This time series could represent for example, temperature readings on different days, collected with sensors. Time series are also used to represent many other types of data such as stock market data.

A time series

A time series

On the other hand, a sequence is an ordered list of nominal values (symbols). For example, a sequence is shown below.

      a, b, c,  e, g, h, a, b, c, e

This sequence indicates that an item (symbol)  a, was followed by an item b, which was followed by an item c,  then, e, g, h, a, b, c, and e, in that order. Sequences have also many real-life applications. They can be used for example to represent sentences in texts (sequences of words), sequences of items purchased by customers in retail stores, and sequences of webpages visited by users.

How to analyze time series?

Both time series and sequences can be analyzed using data mining techniques to discover interesting patterns for understanding the data or decision-making. Since time series are a type of numeric data, and sequences are symbolic data, the traditional techniques for analyzing time series and sequences are quite different. However, it is possible to convert time series to sequences by discretizing the time series (transforming the numbers into symbols. Then techniques for analyzing sequences can also be applied to analyze the time series.

There exists several ways of transforming time-series to sequences. One of the most popular techniques is the SAX algorithm (Lin et al., 2007). But many other techniques exist and variations. The SPMF open-source data mining library discussed in this blog post, offers a fast Java implementation of the SAX algorithm.  Let’s see how it works.

Discretizing a time series

Consider the time series shown in the previous figure. It contains n=11 data points.  To apply the SAX algorithm, it is necessary to specify a number of symbols w and a number of data points v. Let’s say that the number of symbols is 4  and that the number of data point is 8.  The SAX algorithm will perform two steps.

First, it will convert the time series from 11 data points to 8 data points. To do this, it will split the time series into 8 segments and replace each segment by its average. This is called the piecewise aggregate approximation (PAA) of the time series. Visually, this gives the following result.

Transforming a time series to its PAA representation

Transforming a time series to its PAA representation

As it can be seen above, transforming a time series to its PAA representation is a way of reducing the dimensionality of a time series by reducing the number of data points.

Second, the SAX algorithm transforms the PAA representation of the time series to symbols. Each data point is replaced by a symbol. The number of symbols is selected by the user (here we assume that v = 4 symbols for this example). Now, the main question is how the symbols are chosen? The main idea in SAX is to assume that values follow a normal distribution and to choose the symbol to represent various intervals of values such that each interval is equally probable under the normal distribution (for more details, see the paper of Lin et al. 2007). Applying the SAX algorithm implementation of SPMF on the time series produces the following result.

Four symbols are created:
a = [-Infinity, 4.50]
b = [4.50, 6.2]
c = [6.2, 7.90]
d = [7.90,Infinity]

Then the result is a sequence of symbols:    a, a, c, d, d, c, c, b.  Let’s see this more visually in the following figure.

SAX representation of the time series

SAX representation of the time series

This sequence is the symbolic representation of the time series. Then, after a time series has been converted to a sequence, it is possible to apply traditional pattern mining algorithms on the time series.

For example, in the SPMF library, several algorithms are providing for performing data mining on sequences of symbols.  However, most of them requires to have several sequences of symbols to for example find patterns common to several sequences (sequential pattern mining). If one has a single time series, this is not a problem. It is also possible to split a time series into several sequences.

Splitting a time series

In SPMF, there is currently two ways of splitting a time series: (1) splitting into segments having a given number of data points, or (2) splitting a time series into a given number of segments. For example, if we split the above time series into two segments, we would obtain:

Splitting a time series

Splitting a time series

Then, these two time series could be converted to sequence of symbols using the SAX algorithm to apply traditional pattern mining algorithms such as sequential pattern mining and sequential rule mining algorithms, to find patterns that are common to these sequences. Of course, this is just a toy example, so there is no interesting patterns to be found in these time series. 😉

Removing noise in a time series using the moving average

There exists various ways of removing noise in time series. One of the simplest is to calculate the moving average of the time series. To do this, the user must specify a window size z (a number of data points). Then, each data point of the time series is replaced by the average of the previous z points. For example, if we use again the same time series, and set z to 3 points, the result is:

Moving average of a time series

Moving average of a time series

 As it can be seen in this example, applying the moving average makes the time series to become more “smooth”.

Other possibilities: time series clustering and vizualization

This blog post has given an overview of the main operations for time series mining in SPMF. More details can be found in the documentation of SPMF in the SPMF website.

Note that it is also possible to apply clustering algorithms such as K-Means, also provided in SPMF, to automatically group similar time series (time series clustering).

Besides, a time series viewer is also integrated in SPMF for visualizing time series and performing basic operation such as zooming in, zooming out, and printing.

SPMF time series viewer

SPMF time series viewer

This time series viewer will also be improved in future versions of SPMF, with additional features for manipulating time series. Other features will also be added for time series mining. Any code submission related to time series is also welcome.

For those familiar with SPMF, all the time series operations are grouped inside a new category called “time series mining” in the user interface of SPMF:

Time series operations in SPMF 2.06

Time series operations in SPMF 2.06


In this blog post, we have introduced time series data mining with the SPMF Java open source data mining library.  This gives a brief overview of possibilities for time series mining. Of course, there exist numerous other algorithms for time series mining. But this already provides many possibilities.

 If you like this blog, you can subscribe to the RSS Feed or my Twitter account ( to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!


Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation  of time series. Data Mining and Knowledge Discovery 15, 107–144 (2007)

Related posts:

Posted in Big data, Data Mining, Open-source, Time series | Tagged , , , , , , , , | 2 Comments

Plagiarism at Ilahia College of Engineering and Technology by Nasreen Ali A and Arunkumar M

I have found that two professors from the Ilahia College of Engineering and Technology  named Nasreen Ali A ( )  and Arunkumar M have plagiarized one of my paper.

The plagiarized paper is the following:

asreen Ali A.1 , Arunkumar M. Mining High Utility Itemsets from its Concise and Lossless Representations. IOSR Journals (IOSR Journal of Computer Engineering), 1(17), pp.8-14. (the full paper can be downloaded at: )

Arunkumar M and Nasreen Ali A Ilahia College of Engineering

Plagiarized paper by Arunkumar M and Nasreen Ali A at Ilahia College of Engineering

The paper plagiarized my paper describing the FHM algorithm, which was published at ISMIS 2014.  Basically, these two authors have copied the algorithm and claim that it is their algorithm. They even claim that they have proposed the new EUCP strategy which was proposed in my ISMIS 2014 paper.

This is unacceptable. I have contacted the editor of that paper from this journal named IOSR Journal of Computer Engineering. Moreover, I also contacted the head of this college called the  Ilahia College of Engineering and Technology/ Mahatma Gandhi University, India to report this serious case of plagiarism.

About Nasreen Ali A and Arunkumar M

Who are these two professors named Nasreen Ali A and Arunkumar M who have plagiarized my paper?  I have done a quick search, and according to the website of their university, here are their pictures:

Nasreen Ali A and Arunkumar M from Ilahia College of Engineering

Nasreen Ali A and Arunkumar M from Ilahia College of Engineering

Nasreen Ali has no shame to put her plagiarized paper in her CV on the Ilahia College website:

Ilahia College of Engineering - Plagiarism Nasreen Ali

Ilahia College of Engineering – Plagiarism Nasreen Ali (

Here is another picture of Nasreen Ali from the Ilahia College of Engineering and Technology website ( ).

Another picture of Nasreen Ali from Ilahia College of Engineering

Nasreen Ali from Ilahia College of Engineering

According to the college website, it can be found that the e-mail addresses of Nasreen Ali and Arunkumar M are : and ( ).

The webpage of the Ilahia College of Engineering and Technology is :

Ilahia College of Engineering and Technology

Ilahia College of Engineering and Technology

About the journal “IOSR Journal of Computer Engineering” 

The journal where the plagiarized paper was pulished is called ( IOSR Journal of Computer Engineering ). It appears to be some unknown journal that charge a fee to get published and obtain a certificate of publication.  I had never heard of that journal or publisher before.

IOSR Journal of Computer Engineering (IOSR-JCE)

IOSR Journal of Computer Engineering (IOSR-JCE)

2nd December 2016 

On the 2nd December, I have sent the following e-mail to the Ilahia College of Engineering of Engineering and the journal, to let them know about this serious incident:

Plagiarism at the Ilahia College of Engineering 2

I immediately received an answer from the journal the day after and they have retracted the paper.  But I did not receive any answers from the department where these professors work.

28th December 2016

Four weeks after having contacted the  Ilahia College of Engineering and the deparment, I sill did not receive any answer from them.

Thus, I have sent another e-mail to the principal of the Ilahia College of Engineering:
Plagiarism at the Ilahia College of Engineering

2nd February 2017

Two months later, I still did not receive any answer from the Ilahia College of Engineering or the department where these professors are working. It seems that they just want to ignore the serious plagiarism issues happening at their institution, which raises serious questions about the quality of that college.

14th April 2017

Four months later, I sent yet another e-mail to the head of the department Rosna P Haroon and the principal of the Ilahia College of Engineering.  This person Rosna P Haroon is responsible for the department where the two professors who have plagiarized my work are working. Here is the e-mail below:


15th April 2017

Today, about four months later, I received some acknowledgement that my e-mail has been received from the head of that department Rosna P Haroon, but no clear answer yet was provided about how they will handle this case of plagiarism.  They told me to wait for a few days for more information… Let’s see.

If no actions is taken, I will have to consider other actions such as contacting the institutions or organizations such as AICTE ( who is the approving authority of engineering institutions in India or affiliated institutions such as the MG University ( As researcher, I cannot accept that other people are plagiarizing my work.

By the way, although the journal paper has been retracted by the journal, there is still some online versions available on various websites, that should also be taken down to resolve this issue.

17th April 2017

I received some kind of “apology” from Nasreen Ali.  She claims that she did not cite my paper because she thought it was not published, and she claimed that she never said that she proposed my algorithm. But these claims are ridiculous.  In her paper, she write clearly that she is proposing the technique from my paper. Besides, she stole a large part of the text of my paper. She also stole the name and some pseudocode, without even citing my paper.

Then, a few hours later, I received an e-mail from Rosna P Haroon, the head of department to tell me that Nasreen Ali had sent me an e-mail. I told Rosna P Haroon that I cannot accept this kind of fake apology where they are still denying that it was plagiarism. I told R. P. Haroon that there should be some punishment since plagiarism is unacceptable, and that I want to know what will happen.

10th May

Then,  the head of the department Rosna P Haroon did not answer me about any form of punishments that there would be.  Besides, six months have passed already since the time that I have reported the issue of plagiarism and these two professors are still working at that college as if nothing happened! 

About plagiarism in India

Unfortunately, it is not the first time that some researchers from India have plagiarized my papers (for example, a similar problem of plagiarism at the Galgotias college). I think that it happened three times already only in 2016, and at least five or six times in the last two or three years.  Plagiarism is of course not limited to India. My papers have been plagiarized also by people from France, Algeria, and other countries. And it should be said that there are many very good Indian researchers and universities. From my experience, it seems that the plagiarism issue in India is mostly in small universities.


I will keep you updated about what happens about this case of plagiarism. I am looking forward to see what the Ilahia College of Engineering will do, for these two professors who have committed plagiarism.

Related posts:

Posted in Academia, Plagiarism | Tagged , , | 12 Comments

Postdoctoral positions in data mining in Shenzhen, China (apply now)

The CIID research center of the Harbin Institute of Technology (Shenzhen campus, China) is looking to hire two postdoctoral researchers to carry research on data mining / big data.

Harbin Institute of Technology (Shenzhen)

The applicant must have:

  • a Ph.D. in computer Science,
  • a strong research background in data mining/big data or artificial intelligence,
  • demonstrated the ability to publish papers in excellent conferences and/or journals in the field of data mining or artificial intelligence,
  • have an interest in the development of data mining algorithms and its applications,
  • can come from any country (but if the applicant is Chinese, s/he should hold a Ph.D. from a 211 or 985 university, or from a university abroad).

The successful applicant will:

  • work on a data mining project that could be related to sequences, time series and spatial data,  or some other topics related to data mining with both a theoretical part and an applied part related to industrial design (the exact topic will be open for discussion to take advantage of the applicant’s strengths),
  • join an excellent research team, led by Prof. Philippe Fournier-Viger, the founder of the popular SPMF data mining library, and have the opportunity to collaborate with researchers from other fields,
  • will have the opportunity to work in a laboratory equipped with state of the art equipment (e.g. brand new powerful workstations, a cluster to carry big data research, virtual reality equipment, body sensors, and much more).
  • will be hired for 2  years, at a salary of 171,600 RMB  / year  ( 51,600 RMB from the university + 120,000 RMB from the city of Shenzhen) .  Note that there the post-doctoral researcher will pay no tax on the salary, and that an apartment can be rent at a very low price through the university (around 1500 RMB / month, which saves a lot of money).
  • work in one of the top 50 universities in the field of computer science in the world, and one of the top 10 universities in China.
  • work in Shenzhen, one of the fastest-growing city in the south of China, with low pollution, warm weather all year, and close to Hong Kong.

If you are interested by this position, please apply as soon as possible by sending your detailed CV (including a list of publications and references), and a cover letter to Prof. Philippe Fournier-Viger:  It is possible to apply for year 2017-2018.


Related posts:

Posted in artificial intelligence, Big data, Data Mining, Research | 1 Comment

The KDDCup 2015 dataset

The KDD cup 2015 dataset is about MOOC dropout prediction. I have  had recently found that the dataset had been offline on the official website. Thus, I have uploaded a copy of the KDD cup 2015 dataset on my website. You can download it below.
kddcup 2015

I don’t have other information on this dataset besides what is provided above. If you want to share ideas with others or ask questions you can use the comment section below.

Hope that the dataset is useful! 🙂

If you like this blog, you can subscribe to the RSS Feed or my Twitter account ( to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

Related posts:

Posted in Big data, Data Mining, Data science, Research | Leave a comment

What not to do when applying for a M.Sc. or Ph.D position?

This brief blog discusses what not to do when applying for a M.Sc. or Ph.D. position in a research lab. The aim of this post is to give advices to those applying for such positions.


I had previously discussed about this topic in another blog post, where I explained that it is important to send personalized e-mails to potential supervisors rather than sending the same e-mail to many professors. I will thus not explain that again.

In this post, I will rather emphasizes another important aspect, which is to give a good impression of yourself to other people. I discuss this by using an e-mail that I have received today:

From *****@***.sa
Subject: Apply For Scholars ship Ph.d

Sir Philippe Fournier

A person sending this type of e-mails has zero chances of getting a position in my team. Why?

  • It is poorly written. There are many typos and English errors. If this person cannot take the time to write an e-mail properly, then it gives the impression that this person is careless, and would do a bad job.
  • The person did not take the time to spell my full name correctly. Not spelling the name properly shows a lack of respect or shows carelessness. This is something to absolutely avoid.
  • The person asks  to work on web security. I have never published a single paper on that topic. Thus, I would not hire that person. This person should contact a professor working on such topics.
  • The applicant does not provide his CV and very few information about himself. He mentions that he has two publications. But it does not mean anything if I don’t know where they have been published. There are so many bad conferences and journals. An applicant should always send his CV, to avoid sending e-mails back and forth to obtain the information. I will often not answer the e-mails if there are no CV attached to the e-mail, just because it wastes time to send e-mails to ask for the CV, when it should be provided. Besides, when a CV is provided, it should be detailed enough. It is even better when a student can provides transcripts showing his grades in previous studies.
  • The applicant does not really explain why he wants to work with me or how he has found my profile. For a master degree, this is not so important. But when applying for a Ph.D., it is expected that the applicant will chose his supervisor for a reason such as common research interests (as I have discussed above).

That is all I wanted to write for today.

If you like this blog, you can subscribe to the RSS Feed or my Twitter account ( to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

Related posts:

Posted in General, Research | Leave a comment