Today, I will continue talking about pattern mining, and in particular about what is a good pattern mining algorithm. There are a lot of algorithms for discovering patterns in data that are not useful in real-life. To design a good pattern mining algorithm, I will argue that it should ideally have some of the following desirable properties (not all are required):
- Can be used in many scenarios or applications: Some algorithms are designed for tasks that are not realistic or that makes assumption that do not hold in real life. It is important to design algorithms that can be used in real-life scenarios.
- Is flexible. An algorithm should ideally provide optional features so that it can be used in different situations where requirements are different.
- Has excellent performance: An algorithm should be efficient, especially if the goal is to analyze large datasets. In particular, it can be desirable to have algorithms that have linear scalability or can scale well to be able to handle big data. Efficiency can be measured in terms of runtime and memory.
- Has few parameters: An algorithm that has too many parameters is generally hard to use. However, it is OK to have optional parameters to provide more features to users. Some algorithms also do not have any parameters. This is the case for example of skyline pattern mining algorithms and some compression-based pattern mining algorithms.
- Is interactive: Some systems for pattern mining will provide interactive features such as giving to the user the ability to guide the search for patterns by providing feedback about the patterns that are discovered. Some systems will also let the user perform targeted queries about specific items rather than finding all possible patterns.
- Visualization: Having visualization capabilities is also a useful feature to help browsing through numerous patterns.
- Can deal with complex data: It is also desirable to design algorithms that can deal with complex data types such as sequences and graphs as real-life data is often complex.
- Can discover statistically significant or correlated patterns: Many algorithms can find millions of patterns but many of the patterns are spurious. In other words, some patterns may be weakly correlated or just appear by chance. To find significant patterns, a key approach is to use statistical test or correlation measures.
- Let the user select constraints to filter out patterns: For users, a key features is to be able to set constraints on the patterns to be found such as a length constraint, so as to reduce the number of patterns.
- Summarize or compress the data: Another important feature of a good pattern mining algorithm is the ability to find patterns that summarize or compress the data. In fact, rather than finding millions of patterns, it can be useful to find patterns that are representative in the sense that they capture well the characteristics of the data.
- Discover pattern types that are interesting: The type of patterns that is discovered by an algorithm must be useful, or event surprising. It should not be too complex or too simple.
- Can find an approximate solution: Because exact algorithms for pattern mining are often slow or do not scale well, designing approximate algorithms that can give an approximate solution is also important.
This is the list of properties that I think are the most important for pattern mining algorithms. Hope it has been interesting. Leave a comment below, if you want to add something else. 🙂
Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.