The Data Blog

SPMF data mining library 0.98: new pattern visualization window

Posted on 2016-01-14 by Philippe Fournier-Viger

This blog post is to let you know that I have just published a new version of the SPMF open-source Java data mining library (0.98) that offers a new window for visualizing the patterns found by data mining algorithms. This window should be more convenient to use than the text editor for visualizing results as it offers sorting and filtering functions. Here is a picture:

This window is specifically designed for visualizing patterns found by pattern mining algorithms. But it also work with some algorithms such as clustering algorithms, also offered in SPMF.

How to access this new window? This window can be accessed when using the graphical interface of SPMF by selecting the checkbox “Open output file” “using SPMF viewer“. The new window show the patterns found by an algorithm in a table, and it let the user apply some filters to select patterns or to sort the patterns by ascending or descending orders using various measures such as support and confidence (depending on the algorithms) by clicking on the column headers.

This window for visualizing patterns should work with most algorithms offered in SPMF. If you find some bugs related to this new window for visualizing results, you may let me know. Also, if you have some ideas to improve, or want to participate to improve the user interface of SPMF, you may let me know.

Hope you will like it! Also, thanks again to all users of SPMF for your support.

Philippe Fournier-Viger is a professor and also the founder of the open-source data mining software SPMF, offering more than 100 data mining algorithms. Follow me on Twitter: @philfv

Posted in Data Mining, General, open-source, Research, spmf | Tagged big data, data mining, GPL, library, open-source, spmf | Leave a comment

The ADMA 2015 conference has been cancelled

Posted on 2015-11-30 by Philippe Fournier-Viger

Recently I have submitted a few papers to the ADMA 2015 conference (11th conference on Advanced Data Mining and Applications, which was supposed to be held at Fudan university, China in January 2016 (despite being named 2015).

The website of ADMA 2015 conference ( http://adma2015.fudan.edu.cn ) has been online since the end of October.

The ADMA 2015 conference website

I had submitted two data mining papers to the conference using their EasyChair website, and some colleagues of mine also submitted papers:

ADMA2015 easychair

According to the website, the deadline notification of the ADMA 2015 conference is supposed to be the 1st October. I sent an e-mail to the organizers on the 24th September to ask about the deadline and they replied that there would be at least a two weeks deadline extension.

Email from the ADMA 2015 conference organizers

Then, the time passed and I did not receive any notification about the fate of my papers.

I have thus sent e-mails to the organizers (on the 28th October, 10th November, and 17th November) to ask when we would get the notification for our papers. But the organizers of the ADMA 2015 conference did not answer any e-mails from me or my colleagues. I have tried the official e-mail address of the conference 11thadma@fudan.edu.cn and also directly the e-mail of the organizer. But no answer.

Now, we are the 30th November, and it seems that the website of the ADMA 2015 conference has been taken down.

I thus have to conclude that the conference has most likely been cancelled. But why? And why not answering the e-mails, or letting us know ?

It is a pity because I actually enjoyed the previous ADMA conferences.

If I receive further news about what is happening, I will update the blog post. Hopefully, we will know soon what is happening with the papers that have been submitted.

Update in January 2016: It is clear that the conference has been cancelled, although the organizers never bothered to inform the authors or answer their e-mails about the status of the conference. This is really unprofessional.

Update in 2017: After the failed ADMA 2015 conference, the ADMA conference has been back in 2016 with a conference in Australia. That conference was not organized by the organizers of the failed ADMA 2015 conference, and has been a success to my knowledge. So I am looking forward to ADMA 2017. It is also interesting that the website of the failed ADMA 2015 conference is suddenly back online: http://ndbc2011.fudan.edu.cn/

Posted in Big data, Conference, Data Mining | 5 Comments

Interview with the SPMF library founder

Posted on 2015-10-12 by Philippe Fournier-Viger

Today, I will just write a short blog post to let you know that I was recently interviewed on Rahaman’s blog. The interview talks about various topics such as (1) why creating the SPMF data mining library, (2) why choosing to work in academia instead of in the industry, (3) what are the skills required to be a successful researcher, and (4) how to improve writing skills for research. You can read the interview here:

http://rahablog.com/index.php/2015/10/10/interview-with-dr-philippe-fournier-viger/

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 100 data mining algorithms. Follow me on Twitter: @philfv

Posted in Big data, Data Mining, Data science, Research, spmf | Tagged big data, data mining, pattern mining, spmf | 2 Comments

200,000 visitors on the SPMF website!

Posted on 2015-09-29 by Philippe Fournier-Viger

Today, I will just write a short blog post to mention that the SPMF open-source data mining library has recently passed the milestone of 200,000 visitors. This is possible thanks to the support of all users of SPMF, and the contributors who have provided source code, suggested bug reports, and more. Here is a chart showing the growing popularity of SPMF.

Thanks for your support!

Philippe Fournier-Viger, Ph.D.
Founder of the SPMF data mining library

Posted in Data Mining, Data science, open-source, Research, spmf | Tagged data mining, data science, open source, pattern mining, software, spmf | Leave a comment

Top 5 Data Mining Books for Computer Scientists

Posted on 2015-08-25 by Philippe Fournier-Viger

I have often been asked what are some good books for learning data mining. In this blog post, I will answer this question by discussing some of the top data mining books for learning data mining and data science from a computer science perspective.

These books are especially recommended for those interested in learning how to design data mining algorithms and that wants to understand the main algorithms as well as understand some more advanced topics.

“Introduction to data mining” by Tan, Steinbach & Kumar (2006)

This book is a very good introduction book to data mining that I have enjoyed reading . It discusses all the main topics of data mining: clustering, classification, pattern mining and outlier detection. Moreover, it contains two very good chapters on clustering by Tan & Kumar, which are specialists in this domain. What I like about this book is that the chapters explain the techniques with enough details to have a good understanding of the techniques and their drawbacks unlike some other books that do not go into details. Some free sample chapters of the book can be found here. Before buying this book, note that a 3rd edition has been announced to be released soon, although it has been delayed for more than a year.

2. Data Mining: Concepts and Techniques, Third Edition by Han, Kamber & Pei (2013)

This book is another great book that I like. I have also used it for teaching data mining. It covers all the main topics of data mining that a good data mining course should covers, as the previous book. However, this book is more like an encyclopedia. It covers a lot of topics and give a very broad view of the field but does not cover each topics in much details. It is also designed for a computer scientist audience. Besides, it is also written by some top data mining researchers (Han & Pei).

3. Data Mining and Analysis Fundamental Concepts and Algorithms by Zaki & Meira (2014)

This is another great data mining book written by a leading researcher (Zaki) in the field of data mining. It also target computer scientist. This books covers all the main topics of data mining but also has some chapters on some advanced topics such as graph mining, which are very interesting. A version of the book that can be used for personal use only is offered freely here. The algorithms in this books are very detailed and it is possible to implement them by reading the book. In general, some algorithms are presented in each chapter. They are not always the best algorithms but are often the most popular (the classical algorithms).

4. Data Mining: The Textbook by Aggarwal (2015)

This is probably one of the top data mining book that I have read recently for computer scientist. It also covers the basic topics of data mining but also some advanced topics. Moreover, it is very up to date, being a very recent book. It is also written by a top data mining researcher (C. Aggarwal). It also covers many recent and advanced topics such as time series, graph mining and social network mining, not covered in several other books.

5. “The Elements of Statistical Learning” by Freidman et al (2009)
This is aquite popular book a little bit more focused toward statistics. It covers both many data mining techiques such as Neural networks, association rule mining, SVM, regression, clustering and other topics. What is interesting about this book is that it is a top book used in many university courses like the others and can be downloaded for free here.

Conclusion

In this blog post, I have discussed some of the top books for learning data mining algorithms for computer scientists. I have tried to discuss about general books that gives a good foundation for learning data mining and that can also be interesting for advanced topics. However, note that if one is interested in specific topics such as recommender systems and text mining, there also exists some specialized books that covers only these topics in details, that may also be interesting.

That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Posted in Data Mining, Research | 3 Comments

How to design memory-efficient data mining algorithms in Java?

Posted on 2015-08-25 by Philippe Fournier-Viger

A while ago, I had written a blog post about How to measure the memory usage of algorithms in Java. Today, I will discuss the topic of optimizing the memory usage of algorithms written in Java to design memory-efficient data mining algorithms. The reason for writing about this topic is that memory consumption is often a challenge in Java for implementing data mining algorithms.

Brief overview of Java memory management

I will first give a brief overview of Java memory management model. In Java unlike in many other language such as C++, the user generally does not have the power to finely control how the memory is managed. The user can allocate memory by create some objects or variables. However, there is no simple way to control when the memory will be freed. In Java, the Garbage Collector (GC) is the process responsible for automatically freeing the memory. Using a GC has the advantage of making programming in Java easier and to avoid memory leaks and other memory related problems. However, using a GC makes the the memory usage much less predictable. The reason is that there is absolutely no guarantee about when the GC will free the memory used by a Java program. The GC periodically checks references to objects and when an object is not referenced anymore, it may be freed. There is a common myth that if someone calls System.gc() the garbage collector will immediately free the memory. However it is not the case.

So how to optimize memory usage in Java?

There many ways to optimize memory usage. I will discuss a few principles for optimizing memory usage in Java below, and I will then provide a detailed example with some Java code.

1) Using memory-efficient data structures. A first principle for optimizing memory usage is to use memory efficient data structures when possible. For example, one may consider using an array of integers instead of an ArrayList because ArrayList introduces some memory overhead. Another example: one may uses int instead of Integer. But there is sometimes a trade-off between memory usage and execution time. Thus, one should not just think about memory when choosing a data structure but also about execution time.

2) Reducing object creation. An important thins to know in Java is that garbage collection is VERY costly. In particular, if a Java program reaches the memory limit of Java, then the program may suddenly become very slow because of the GC (see my previous blog post explaining this issue). Thus, a very good strategy to optimize Java algorithms is to design the algorithms such that variables/objects are reused as much as possible (when it makes sense) rather than creating new variables/objects. If less variables/objects are created, then less memory will be used and the GC will have to work less, which may also improves speed For example, imagine a for loop that is repeatedly creating objects. It is sometimes possible to declare a single variable/object outside the for loop and reuse the same object. Again, whether it should be done or not depends on the overall picture. In general, one should focus on optimizations that are meaningful and not do micro-optimizations. Moreover, performing optimizations should ideally not decrease the code readability or maintainability.

A variable or object that is reused can be called a buffer object.

A detailed example.

I will now present a detailed example showing how the two above principles can be used to improve the memory efficiency of some code. The example that I will provide is abstract and can be applied to most depth-first search pattern mining algorithms. The solution that I will present was applied data mining algorithm implementations of the SPMF data mining library written in java..

Consider the following code. A List of Integer is first created. Then a recursive method is called. This recursive methods copy the list, add an element to the list and then recursively call itself. This method is not memory efficient since every time that it is called it will create a new List object. This can be quite slow because of object creation time. But moreover, the GC will have to free all these objects, which will also slow down the program.

static public void  main(String[] args){
		List<Integer> list = new ArrayList<Integer>();
		recursiveMethod(list);
	}


	static public void recursiveMethod(List<Integer> list) {
		// make a copy of the list
		List<Integer> copyOfList = new ArrayList<Integer>();
		copyOfList.addAll(list);
		

		// Add some new integer to the list
		int integer = ... ;
		// ...
		copyOfList.add(integer);
		
		//.... the method is called recursively in a for loop
		recursiveMethod(copyOfList);
	}

Now, let’s look at a better solution, below. First, instead of using a List of Integer, an array of integer is used. This is already better in terms of memory since it is a more simple data structure. Second, the array of integers is used as a buffer object. Instead of repeatedly creating List objects, the same buffer is always reused. The buffer is initialized with a large enough size (for example 500 entries). This version of the code is much faster because (1) it is not necessary to always create objects, (2) we don’t copy list anymore, (3) a single item is copied for each recursive call!

static public void  main(String[] args){
		int[] buffer = new int[500];
		int currentSize = 0;
		recursiveMethod(buffer, currentSize);
	}


	static public void recursiveMethod(int[] buffer, int bufferSize) {
		// Add some new integer to the list
		int integer = ...;
		buffer[bufferSize++] = integer;
		
		//....  the method is called recursively in a for loop
		recursiveMethod(buffer, bufferSize);
	}

The above solution is extensively used in algorithm implementations of the SPMF data mining library. In some cases, this allowed to reduce memory usage by half.

Conclusion

In this blog post, I have discussed the problem of designing memory-efficient algorithms in Java. I have presented a few principles and then presented a detailed example of how to optimize data mining algorithms in Java. Hope you have enjoyed that post. In future blog post, I will discuss more examples of memory optimizations, if there is enough interest on this topic!

That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Posted in Uncategorized | Leave a comment

The SPMF data mining library: a brief history and what’s next?

Posted on 2015-08-19 by Philippe Fournier-Viger

In this blog post, I will talk about the well-known open-source library of data mining algorithms implemented in Java, which I am the founder of. I will give a brief overview of its history, discuss some lessons learned from the development of this library, and then give a glimpse of what’s next for the development of the library.

A brief history of SPMF

The first version of this library was designed at the end of 2008 as a term project for a data mining course during my Ph.D. at University of Quebec in Montreal. At that time, I had implemented about five algorithms such as Apriori and AprioriClose. The code was not so great and there was no website. And it was just an unnamed project. 😉

Then, in 2009, I started to work on implementing and developing new sequential pattern mining algorithms for my Ph.D. project, and to add them to the same project. I added several algorithms such as PrefixSpan and BIDE. I then launched the SPMF website during the summer of 2009, and choose the name SPMF for the project. At that time, the website had few information. It just provided a few instructions about how to download the library and use it.

Over the years, I have added much more algorithms to the librayr. There are now more than 90 algorithms offered in SPMF. I have implemented many of them in my spare time, some of them for my research, some of them just for my personal satisfaction, and also several contributors have provided source code of algorithms for the library, and have reported bugs, and suggestions, which have also greatly helped the project. I have also added a user graphical interface and command line interface to SPMF in the last few years.

The SPMF graphical user interface

The source code of SPMF has been quite improved over the year. Originally, there was a lot of duplicated code in the project. In the years 2012-2013, I have made a major refactoring of the source code that took about 1 month. I removed as much duplicated code as possible. As a result, the number of source code files in the project was reduced by 25 %, the number of lines of code was reduced by 20 %. Moreover, I added about 10,000 lines of comments during this refactoring. In the last two years, I have also added several optimizations to the source code of SPMF because some code written in the early year was not really optimized as I did not have enough experience implementing data mining algorithms.

Since then, SPMF has become quite popular. It is an especially important library in the field of pattern mining (discovering patterns in databases). The number of visitors on the website recently reached 190,000. Moreover, SPMF was cited or used in about 190 research papers in the last few years, which is awesome. Here is a brief overview about the number of visitor on the website:

The lessons learned

From the SPMF project, I have learned a few general lessons about providing an open-source project.

It is important to make a high-quality documentation of how to use the library. If there is no appropriate documentation on the website, then users will always ask questions about how to do this or do that, and the developers will spend a lot of time to answer these questions. The users will also be less likely to use the library if it is too complicated to use. On the contrary, if a good documentation is provided, then most users will find answers in it. Thus the reviewers will spend less time always answering the same questions and users are more likely to use the software. Over the years, I have updated the website so that it provides information for the most common questions. I have also added a developpers’s guide, a documentation of how to use each algorithm, etc. to try to make the software as easy to use a possible.
The code should follow standard conventions and be well-documented. To make an open-source project easily reusable and understandable by other users, the code should contain a good amount of comments, be well-structured, and follow commonly used conventions. For example, in Java, there are standard conventions for writing code and documenting code with Javadoc. In SPMF, I have tried to follow these conventions as much as possible. As a result, several users have said to me that the code of SPMF is very easy to understand. It is important to write good code. I understand that many programmers may not like to document their code, but it is important to do it as it makes it much more understandable for users.
It is important to choose an appropriate license for an open-source project. I originally choose theCreative Common License for SPMF in 2009. But I then noticed that it was rarely used for licensing software. I thus then read about several licenses and choose the more commonly used GPL, which I prefers.
Listen to the users. It is important to listen to what users need in terms of features. This gives a good indication of what should be included in the software in the next releases. If many users request a specific feature, it is probably very important to provide it.

What’s next?

So what is next for SPMF? I intend to continue developing this library for at least several years 😉

I have currently implemented several new algorithms that have not yet been released such as: FOSHU, d2Hup, USpan, TS-Houn, HUP-Miner, GHUI-Miner, HUG-Miner mainly for high-utility pattern mining. Also my students have implemented several others for sequence prediction and pattern mining such as: CPT+, CPT, DG, TDAG, AKOM and LZ78, and EFIM and HUSRM. All these algorithms should be released soon in SPMF. I think that several of them may be released in a new major release in September of October. Thus, SPMF should reach the milestone of 100 algorithms before the end of 2015.

Other improvements that I would like to add in the future are to handle more file types as input. For example, it would be great to add a module for converting text files to sequences for sequential pattern mining. Another idea is to add visualization capabilities. Currently, the results of most algorithms offered in SPMF are presented as text files to the user. It would be great to add some visualization modules. Another idea is to add some modules for automatically running experiments for comparing algorithms. This is especially useful for data mining researchers that wish to compare the performance of data mining algorithms.

For the future, I also hope that more collaborators will provide source code to the project. Several researchers have used SPMF in their projects but not many have given back source code to the project. It would be great if more users could provide source code when proposing new algorithms. This would greatly helps the project. If more students or professors would like to contribute to the project, it would be also very welcome.

Also, another important aspect to help the project is to cite the SPMF project in your papers if you have been using it in your research. It should be preferably cited as follows:

Fournier-Viger, P., Gomariz, Gueniche, T., A., Soltani, A., Wu., C., Tseng, V. S. (2014). SPMF: a Java Open-Source Pattern Mining Library. Journal of Machine Learning Research (JMLR), 15: 3389-3393.

Lastly, I would like to say thank you to everyone who has supported the SPMF library over the years either by contributing code, reporting bug, using the software and citing it. This is great!

This is all for today. I just wanted to discuss the current state of SPMF and what is next. Hope that you enjoyed reading this blog post. If you want to get notified of future blog posts, you may follow my twitter account @philfv.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 80 data mining algorithms.

Posted in Data Mining, open-source, Programming, Research, spmf | Tagged data mining, library, open-source, spmf | Leave a comment

An Introduction to Sequential Rule Mining

Posted on 2015-08-18 by Philippe Fournier-Viger

In this blog post, I will discuss an interesting topic in data mining, which is the topic of sequential rule mining. It consists of discovering rules in sequences. This data mining task has many applications for example for analyzing the behavior of customers in supermarkets or users on a website.

Introduction

Before, discussing this topic, let me talk a little bit about the context. There has been a lot of work in the field of data mining about pattern mining . The goal of pattern mining is to discover useful, novel and/or unexpected patterns in databases. In this blog post, we will be interested by a specific type of database called sequences databases. A sequence database contains some sequences. For example, consider the following database:

A sequence database containing four sequences

This database contains four sequences named seq1, seq2, seq3 and seq4. For our example, consider that the symbols “a”, “b”, “c”, d”, “e”, “f”, “g” and “h” respectively represents some items sold in a supermarket. For example, “a” could represent an “apple”, “b” could be some “bread”, “c” could denote “cake”, etc.

Now, a sequence is an ordered list of sets of items. For our example, we will assume that each sequence represents what a customer has bought in our supermarket over time. For example, consider the second sequence “seq2”. This sequence indicates that the second customer bought items “a” and “d” together, than bought item “c”, then bought “b”, and then bought “a”, “b”, “e” and “f” together.

Sequences are a very common type of data structures that can be found in many domains such as bioinformatics (DNA sequence), sequences of clicks on websites, the behavior of learners in e-learning, sequences of what customers buy in retail stores, sentences of words in a text, etc. It is to be noted that sequence can be ordered by time or other properties (e.g. the order of nucleotides in a DNA sequence).

Discovering sequential patterns in sequences

An important data mining problem is to design algorithm for discovering hidden patterns in sequences. There have been a lot of research on this topic in the field of data mining and various algorithms have been proposed.

In the following, Iwill discuss two types of patterns that can be found. I will first discuss sequential patterns. Then, I will explain some of their limitations and then discuss sequential rules.

A sequential pattern is a subsequence that appear in several sequences of a database. For example, the sequential pattern <{a}{c}{e}> appears in the two first sequences of our database. This pattern is quite interesting. It indicates that customers who bought {a}, often bought {c} after, followed by buying {e}. Such a pattern is said to have a support of two sequences because it appears in two sequences from the database. Several algorithms have been proposed for finding all sequential patterns in a database such as CM-SPADE, PrefixSpan and GSP. These algorithms takes as input a sequence database and a minimum support threshold (minsup). Then, they will output all sequential patterns having a support no less than minsup. Those patterns are said to be the frequent sequential patterns.

For example, for the above example, if we run CM-SPADE with minsup = 3, we will find the following frequent sequential patterns:

<{a}> with a support of 3 sequences
<{a},{e}> with a support of 3 sequences
<{a},{f}> with a support of 3 sequences
<{b},{e}> with a support of 3 sequences
<{b},{f}> with a support of 4 sequences

Sequential patterns can be quite interesting. In the example, we can learn that buying item “b” is followed by buying item “e” in 3 sequences. However, sequential patterns can be misleading. An important limitation of sequential patterns is that there is no assessment of the probability that a pattern will be followed. Let me explain this in more details. For example, if we consider again the pattern <{b},{e}>. This pattern is said to appear in 3 sequences. It may thus seems likely that if someone buy “b”, he will also buy “e” after. But how likely? We can observe that item “b” appears in four sequences. Thus, the probability that “e” appears after “b” is actually 3 / 4 = 75 % (i.e. P(e|b)= 75%). But sequential patterns only indicate how often the pattern appears. They do not provide any indication about this probability.

Discovering sequential rules in sequences

This now lead us to the main topic of this post which is sequential rule mining. Sequential rule mining has been proposed as an alternative to sequential pattern mining to take into account the probability that a pattern will be followed. I will provide a few definitions and then we will look at a full example.

A sequential rule is a rule of the form X -> Y where X and Y are sets of items (itemsets). A rule X ->Y is interpreted as if items in X occurs (in any order), then it will be followed by the items in Y (in any order). For example, consider the rule {a} -> {e,f}. It means that if a customer buy item “a”, then the customer will later buy the items “e” and “f”. But the order among items in {e,f} is not important. This means that a customer may buy “e” before “f” or “f” before “e”.

To find sequential rules, two measures are generally used: the support and the confidence. The support of a rule X -> Y is how many sequences contains the items from X followed by the items from Y. For example, the support of the rule {a} -> {e,f} is 3 sequences because {a} appears before the items from {e,f} in three sequences (seq1, seq2 and seq3).

The confidence of a rule X -> Y is the support of the rule divided by the number of sequences containing the items from X. It can be understood as the conditional probability P(Y|X). For example, the confidence of the rule {a} -> {e,f} is 1 (or 100 % if written as a precentage), because every time that a customer buy item “a”, he then buy “e” and “f” in the example database. Another example is the rule {a} -> {b}. This rule has a support of 2 sequences and a confidence of 0.66 (that is 66%).

A sequential rule mining algorithm such as RuleGrowth, ERMiner and CMRules will output all sequential rules having a support and a confidence respectively no less than some thresholds minsup and minconf set by the user. For example, consider again the example database and suppose that the user set minsup = 0.5 and minconf = 60%. The following rules are found by RuleGrowth:

{a,b,c} -> {e} support = 2 sequences confidence = 100 %
{a} -> {c,e,f} support = 2 sequences confidence = 66%
{a,b} -> {e,f} support = 3 sequences confidence = 100%
{b} -> {e,f} support = 3 sequences confidence = 75 %
{a} -> {e,f} support = 3 sequences confidence = 100%
{c} -> {e,f} support = 2 sequences confidence = 100%
{a} -> {b} support = 2 sequences confidence = 66%

These rules can be viewed as more interesting than sequential patterns since they give a measure of confidence that they will be followed. For example, it is very informative to know that some rules such as {c} -> {e,f} have a confidence of 100 %.

In the past, I have carried a study with my student to compare the prediction accuracy of sequential patterns and sequential rules. In that study, we found sequential rules can provide a much higher prediction accuracy than sequential patterns when the patterns are used for sequence prediction. The reason is that sequential rules consider the probability (confidence), while sequential patterns do not.

Extensions of the task of sequential rule mining

In the previous paragraphs, I have introduced the topic of sequential rule mining. But note there also exists several extensions of the problem of sequential rule mining. These extensions have been proposed to address specific needs. I will provide a brief overview of a few extensions.

Discovering the top-k sequential rules. The idea is to discover the k most frequent rules in a dataset having at least a confidence no less than minconf. For example, a user may specify that he wants to find the top 1000 rules having a confidence of at least 75 %. Some algorithms for this task are TopSeqRules and TNS.
Discovering sequential rules with a window size constraint. This algorithm let the user find rules of the form X -> Y where X and Y must be close to each other with respect to time. For example, a user may want to find rules appearing whithin three consecutive itemsets in sequences. This is interesting for example for analyzing sequence of web clicks. An algorithm for this task is TRuleGrowth.
Discovering high-utility sequential rules. Another extension is to discover rules where items may be annotated with quantities in sequences and each item may have a unit profit. For example, we may have a sequence where a customer bought three breads, then two apples and two bottle of milk and these items may have some unit profit of 1$, 2$ and 1.50$. The goal of high-utility sequential rule mining is to find rules that generate a high profit and have a high confidence (high-utility rules). An algorithm for this task is HUSRM.

Open-source implementations and datasets

There exists several algorithms for sequential rule mining and sequential pattern mining that have been proposed. Java implementations of the state-of-the art algorithms are currently offered in my open-source data mining library named –SPMF.

It offers several state-of-the-art algorithms for sequential rule mining such as ERMiner (2014), TNS (2013), RuleGrowth (2011), TopSeqRules (2011), and CMRules (2010). Besides, SPMF offers several algorithms for sequential pattern mining such as CM-SPADE (2014), VMSP (2014), LAPIN (2005) and PrefixSpan (2004). To our knowledge, ERMiner is the fastest sequential rule mining algorithm. But RuleGrowth is still quite fast and consumes less memory. You can try the above algorithms by going to the SPMF website. On the website, you will find instructions about how to run algorithms and some datasets on the dataset page.

SPMF also offers hundreds of algorithm to find other types of patterns such as sequential rules, sequential patterns, frequent itemsets and high utility itemsets.

Applications of sequential rule mining

Some example of applications of sequential rule mining are e-learning, manufacturing simulation, quality control, web page prefetching, anti-pattern detection in service based systems, embedded systems, alarm sequence analysis, restaurant recommendation. For example, here are a few papers describing such applications:

E-learning

Fournier-Viger, P., Faghihi, U., Nkambou, R., Mephu Nguifo, E.: CMRules: Mining
Sequential Rules Common to Several Sequences. Knowledge-based Systems, Elsevier,
25(1): 63-76 (2012)

Toussaint, Ben-Manson, and Vanda Luengo. “Mining surgery phase-related sequential rules from vertebroplasty simulations traces.” Artificial Intelligence in Medicine. Springer International Publishing, 2015. 35-46.

Faghihi, Usef, Philippe Fournier-Viger, and Roger Nkambou. “CELTS: A Cognitive Tutoring Agent with Human-Like Learning Capabilities and Emotions.” Intelligent and Adaptive Educational-Learning Systems. Springer Berlin Heidelberg, 2013. 339-365.

Manufacturing simulation

Kamsu-Foguem, B., Rigal, F., Mauget, F.: Mining association rules for the quality
improvement of the production process. Expert Systems and Applications 40(4),
1034-1045 (2012)

Quality control

Bogon, T., Timm, I. J., Lattner, A. D., Paraskevopoulos, D., Jessen, U., Schmitz,
M., Wenzel, S., Spieckermann, S.: Towards Assisted Input and Output Data Analysis
in Manufacturing Simulation: The EDASIM Approach. In: Proc. 2012 Winter
Simulation Conference, pp. 257–269 (2012)

Web page prefetching

Fournier-Viger, P. Gueniche, T., Tseng, V.S.: Using Partially-Ordered Sequential
Rules to Generate More Accurate Sequence Prediction. Proc. 8th International Conference
on Advanced Data Mining and Applications, pp. 431-442, Springer (2012)

Anti-pattern detection in service based
systems,

Nayrolles, M., Moha, N., Valtchev, P.: Improving SOA antipatterns detection in
Service Based Systems by mining execution traces. In: Proc. 20th IEEE Working
Conference on Reverse Engineering, pp. 321-330 (2013)

Embedded systems

Leneve, O., Berges, M., Noh, H. Y.: Exploring Sequential and Association Rule
Mining for Pattern-based Energy Demand Characterization. In: Proc. 5th ACM
Workshop on Embedded Systems For Energy-Efficient Buildings. ACM, pp. 1–2
(2013)

Alarm sequence analysis

Celebi, O.F., Zeydan, E., Ari, I., Ileri, O., Ergut, S.: Alarm Sequence Rule Mining
Extended With A Time Confidence Parameter. In: Proc. 14th Industrial Conference
on Data Mining (2014)

Ileri, Omer, and Salih Ergüt. “Alarm Sequence Rule Mining Extended With A Time Confidence Parameter.” (2014).

Recommendation

Jannach, Dietmar, and Simon Fischer. “Recommendation-based modeling support for data mining processes.” Proceedings of the 8th ACM Conference on Recommender systems. ACM, 2014.

Interestingly, the above work found that sequential rules found by CMRules provided better results than other compared patterns found using FPGrowth and other algorithms.

Jannach, D., Jugovac, M., & Lerche, L. (2015, March). Adaptive Recommendation-based Modeling Support for Data Analysis Workflows. In Proceedings of the 20th International Conference on Intelligent User Interfaces (pp. 252-262). ACM.

Restaurant recommendation

Han, M., Wang, Z., Yuan, J.: Mining Constraint Based Sequential Patterns and
Rules on Restaurant Recommendation System. Journal of Computational Information
Systems 9(10), 3901-3908 (2013)

Customer behavior analysis

Noughabi, Elham Akhond Zadeh, Amir Albadvi, and Behrouz Homayoun Far. “How Can We Explore Patterns of Customer Segments’ Structural Changes? A Sequential Rule Mining Approach.” Information Reuse and Integration (IRI), 2015 IEEE International Conference on. IEEE, 2015.

What is the difference between sequential rules, association rules and episode rules?

A question that some reader familiar with data mining may have is what is the difference between sequential rules and association rules? The answer is as follows. The sequential rules are found in sequences while association rules are found in records (transactions) containing items that are not ordered. In other words, the order between items such as time is not considered in association rule mining. Some sequential rule mining algorithms such as CMRules will first discover association rules, and then filter out some rules using the sequential ordering to keep only the sequential rules.

It should be noted that there are also some other terms that are used in the research papers. For example, there are a few papers that will talk about “temporal association rules”. This can be viewed as some kind of sequential rules. However, between “temporal association rules” and sequential rules, I prefer the term sequential rules. The reason is that a sequence is not always ordered by time. For example, in a sequence of words, there is an order but it is not based on time. Thus, I think that sequential rules is a better name because it means that there is some order but this order could be based on time or something else, while “temporal” is a name that refers to time.

There are also some other names that are used… Generally, if we find the rules in many sequences, we call them sequential rules. But if we find the rules in only one sequence, some people will call these rules “episode rules”. In SPMF, we recently added some code for discovering episode rules in a single long sequences such as the POERM algorithm. Episode rules can be viewed as a special type of sequential rules for a single sequence.

Conclusion

In this blog post, I have given an overview of the tasks of sequential rule mining and sequential pattern mining, which aims at discovering patterns in sequences. Hope that you enjoyed reading it 😉 For researchers, there are many possibility of research on this topic.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 80 data mining algorithms.

Posted in Data Mining, Research | Tagged data mining, frequent patterns, high utility, sequential rules | 33 Comments

How to test if a data mining mining algorithm implementation is correct?

Posted on 2015-07-28 by Philippe Fournier-Viger

In this blog post, I will discuss how to check if a data mining algorithm implementation is correct and complete. This is a very important topic for researchers who are implementing data mining algorithms since an incorrect implementation may generate unexpected results. It may be also important for those who want to compare two implementations downloaded on the internet for example. There are a few different ways to check if a data mining algorithm implementation works as expected.

1) Is it theoretically correct?

The first step is to make sure that the algorithm itself does not contain any errors. This can be done before implementing the algorithm. It happens sometimes that the description of algorithms in research papers contains errors. These errors may be some typo or they may be some more fundamental errors. It is thus important to read the paper carefully and understand it well before implementing the data mining algorithm, to detect these errors, if there are any.

2) Testing the algorithm on a small dataset – debugging by hand

After the algorithm has been implemented, it is time to test it. First, I usually try to run the algorithm with some small dataset. For example, to test a sequential pattern mining algorithm, I may use a dataset that contains 5 sequences instead of a dataset containing 100,000 sequences. Using a small dataset is much easier for debugging. I will run the algorithm on a small dataset and check if the results are correctly calculated by hand. Or if there is an example provided in the paper describing the algorithm, I will try to use the same example. If there are some errors, I will also vary the parameters to see if the behavior is still correct on the small dataset. I will use the debugger to find the problem and fix it. I may also use the debugger and check step by step how the algorithm works to see if there is some weird behavior. During this step, I may also use some paper and a pencil to make some example by hand on paper and check if the results generated by the algorithm is the same.

Drawing some examples on a piece of paper

Drawing an example on a piece of paper to debug an algorithm implementation

3) Testing the algorithm on large datasets – comparison with another implementation

Second, I will run the algorithm on larger datasets. The reason is that there are some errors that will only occurs on large datasets. For example, I have recently encountered an integer overflow error that was only occuring when the dataset was very large. To test a sequential pattern mining algorithm, I may use a dataset using 1 million sequences.

If I have some other algorithm implementations for the same problem, I will use them to see if the new implementation generate the same results. A simple way to do that is to take the output file of two implementations and use a software to compare two files such as UltraCompare (not free). These kind of software will quickly highlight the differences between two output file. Then, if there are some differences, I will analyse the results further to see which implementation is incorrect.

Comparing the output of two implementations

Comparing the output of two algorithm implementations using UltraCompare

4) Testing the algorithm on large datasets – automatic tests

Another way to check if an implementation is correct is to write some automatic tests. The reason is that on large datasets, it is likely impossible to check by hand if all the results are correct. For example, If I run a sequential pattern mining algorithm on a large dataset, it would be too time consuming to check by hand that the support is calculated correctly for all patterns that are found by the algorithm . To address this issue, I will write some code that take each pattern and scan the database to recalculate the support using a brute force approach. This allow to check that the result is correct automatically. And if I find some errors, I will use the debugger to find where they come from, and fix them.

Conclusion

In this blog post, I have discussed how to check if the implementation of a data mining algorithm is correct. It can be quite time-consuming to check if an implementation is correct. Sometimes, I may spend one or two days for debugging. But it is still very important to verify that an algorithm implementation is correct, especially when proposing a new algorithm. If an algorithm is incorrect, it may completely change the results when comparing the algorithm with other algorithms.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 80 data mining algorithms.

If you like this blog, you can tweet about it and/or subscribe to my twitter account@philfvto get notified about new posts.

Posted in Data Mining, Programming, Research, Uncategorized | Tagged algorithm, correctness, data mining, debugging | 3 Comments

The PAKDD 2015 Conference (a brief report)

Posted on 2015-05-26 by Philippe Fournier-Viger

I have attended the 19th PAKDD 2015 conference in Ho Chi Minh City, Vietnam from the 19th to 22nd May. In this blog post, I give some brief comments about the PAKDD 2015 conference.

PAKDD ( Pacific-Asia Conference series on Knowledge Discovery and Data Mining ) is a top data mining conference, held every year in Asia or the pacific area. The proceedings are published by Springer in the Lecture Notes in Artificial Intelligence series.

The PAKDD 2015 conference was held at the Rex hotel which is centrally located in Ho Chi Minh City.

This year, 405 papers were submitted and 117 papers were accepted, for an acceptance rate of 28.9 %.

The first day had workshops and tutorials, while the following days were paper presentations.

In terms of social activities at PAKDD, there was an excursion to visit some old tunnels used during the Vietnam war, and I also visited the Mekong Delta. I also met some very nice Vietnamese students who used my SPMF software in their research and treated me to restaurant.

At PAKDD, I presented the paper of my student about a new model called CPT+for sequence prediction.

Overall, it was a very interesting conference. I feel that Vietnam is a very nice place. I did not write too much on the blog, as I have been quite busy enjoying the conference 😉

If you are interested, you can also read my report about
PAKDD 2014, PAKDD 2017, PAKDD 2018 , and PAKDD 2019, PAKDD 2020 and PAKDD 2024 conferences.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which offers more than 150 data mining algorithms.

Posted in Conference, Data Mining, Data science | Tagged asia, big data, conference, data mining, pakdd | 4 Comments

SPMF data mining library 0.98: new pattern visualization window

The ADMA 2015 conference has been cancelled

Interview with the SPMF library founder

200,000 visitors on the SPMF website!

Top 5 Data Mining Books for Computer Scientists

How to design memory-efficient data mining algorithms in Java?

The SPMF data mining library: a brief history and what’s next?

An Introduction to Sequential Rule Mining

How to test if a data mining mining algorithm implementation is correct?

The PAKDD 2015 Conference (a brief report)

Archives

Categories

Recent Posts

Recent Comments

Number of visitors:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Archives

Categories

Recent Posts

Recent Comments

Tag cloud

Number of visitors: