Today, I will discuss an important concept in data mining which is the use of constraints.
Data mining is a broad field incorporating many different kind of techniques for discovering unexpected and new knowledge from data. Some main data mining tasks are: (1) clustering, (2) pattern mining, (3) classification and (4) outlier detection.
Each of these main data mining tasks offers a set of popular algorithms. Generally, the most popular algorithms are defined to handle a general and simple case that can be applied in many domains.
For example, consider the task of sequential pattern mining proposed by Agrawal and Srikant (1995). Without going into details, it consists of discovering subsequences that appear frequently in a set of sequences of symbols. In the original problem definition, a user only has two parameters: (1) a set of sequences and (2) a minimum frequency threshold indicating the minimal frequency that a pattern should have, to be found.
But to apply a data mining algorithm in a real application often require to consider specific characteristics of the application. One way to do that is to add the concept of constraints. For example, in the past, I have done a research project where I have applied a sequential pattern mining algorithm to discover frequent sequences of actions performed by learners using an e-learning system (pdf). I first used a popular classical algorithm named PrefixSpan but I quickly found that the patterns found were uninteresting because. To filter uninteresting patterns, I have modified the algorithm to add several constraints such as:
– the minimum/maximum length of a pattern in sequences where timestamps are used
– the minimum “gap” between two elements of a subsequence
– removing redundancy in results
– adding the notion of annotations and context to sequences
By modifying the original algorithm to add constraints specific to the application domain, I got much better results (and for this work on e-learning, I received the best paper award at MICAI 2008). The lesson from this example is that it is often necessary to adapt existing algorithms by adding constraints or other domain specific ideas to get good results that are tailored to an application domain. In general, it is a good idea to start with a classical algorithm to see how it works and its limitations. Then, one can modify the algorithm or look for some existing modifications that are better suited for the application.
Lastly, another important point is for data mining programmers. There is two ways to integrate constraints in data mining algorithms. First, it is possible to add constraints by performing post-processing on the result of a data mining algorithm. The advantage is that it is easy to implement. Second, it is possible to add constraints directly in the mining algorithms so as to use the constraints to prune the search space and improve the efficiency of the algorithms. This is more difficult to do, but it can provide much better performance in some cases. For example, in most frequent pattern mining algorithms for example, it is well-known that using constraints can greatly increase the efficiency in terms of runtime and memory usage while greatly reducing the number of patterns found.
That is what I wanted to write for today. If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!
P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.