The importance of constraints in data mining

Today, I will discuss an important concept in data mining which is the use of constraints.

constraints

Data mining is a broad field incorporating many different kind of techniques for discovering unexpected and new knowledge from data. Some main data mining tasks are: (1) clustering, (2) pattern mining, (3) classification and (4) outlier detection.

Each of these main data mining tasks offers a set of popular algorithms. Generally, the most popular algorithms are defined to handle a general and simple case that can be applied in many domains.

For example, consider the task of sequential pattern mining proposed by Agrawal and Srikant (1995). Without going into details, it consists of discovering subsequences that appear frequently in a set of sequences of symbols. In the original problem definition, a user only has two parameters: (1) a set of sequences and (2) a minimum frequency threshold indicating the minimal frequency that a pattern should have, to be found.

But to apply a data mining algorithm in a real application often require to consider specific characteristics of the application. One way to do that is to add the concept of constraints. For example, in the past, I have done a research project where I have applied a sequential pattern mining algorithm to discover frequent sequences of actions performed by learners using an e-learning system (pdf). I first used a popular classical algorithm named PrefixSpan but I quickly found that the patterns found were uninteresting because. To filter uninteresting patterns, I have modified the algorithm to add several constraints such as:
– the minimum/maximum length of a pattern in sequences where timestamps are used
– the minimum “gap” between two elements of a subsequence
– removing redundancy in results
– adding the notion of annotations and context to sequences
– …

By modifying the original algorithm to add constraints specific to the application domain, I got much better results (and for this work on e-learning, I received the best paper award at MICAI 2008). The lesson from this example is that it is often necessary to adapt existing algorithms by adding constraints or other domain specific ideas to get good results that are tailored to an application domain. In general, it is a good idea to start with a classical algorithm to see how it works and its limitations. Then, one can modify the algorithm or look for some existing modifications that are better suited for the application.

Lastly, another important point is for data mining programmers. There is two ways to integrate constraints in data mining algorithms. First, it is possible to add constraints by performing post-processing on the result of a data mining algorithm. The advantage is that it is easy to implement. Second, it is possible to add constraints directly in the mining algorithms so as to use the constraints to prune the search space and improve the efficiency of the algorithms. This is more difficult to do, but it can provide much better performance in some cases. For example, in most frequent pattern mining algorithms for example, it is well-known that using constraints can greatly increase the efficiency in terms of runtime and memory usage while greatly reducing the number of patterns found.

That is what I wanted to write for today. If you have additional thoughts, please share them in the comment section. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!


P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 50 data mining algorithms.

(Visited 76 times, 2 visits today)

Comments

The importance of constraints in data mining — 6 Comments

  1. Hello, please i am doing my dissertation and would like to ask some questions concerning data mining.

    1. the description of the existing system
    2. the logical analysis of existing system

    Thank you

    • I guess that description of the system means to write some text about the system. For “logical analysis” it must be some description of the components of the system or how it works. But I think that these requirements come from some professor at your university. It is probably best to ask about the exact requirements to make sure that you meet them.

  2. What are the main constraints considered when we want to do data mining. What are the models.
    Appreciate your thoughts on high utility data constraints if any?

    • Hello, There are a lot of different kinds of constraints in data mining.
      For example, if you think about clustering algorithms, then there are some constraints such that some data points must be in the same cluster, or cannot be in the same cluster.
      Another example, if you think about itemset mining or pattern mining, then constraints can be that an itemset or pattern cannot contain more than X items.
      Also, in pattern mining, many other constraints are possible. For example, if you want to discover high utility patterns that are periodic, you can apply some constraints related to the periodicity of the pattern such that the pattern must appear every week in your data (have a maximum periodicity less than 1 week). Or another example, if you want to find rare patterns, then you could set a constraint to say that the patterns must not appear in more than 50 % of the transactions in the database.
      Those are just a few examples. Many other types of constraints are used and could be used. It would be too long to list them all.

      • Thank you Sir.
        Sir, do you publish any survey on High Utility Data Minining with or without constraints.

        Appreciate your thoughts.

        Regards,
        raju

Leave a Reply

Your email address will not be published. Required fields are marked *