In this blog post, I will give a brief overview of an important subfield of **data mining** that is called **pattern mining**. Pattern mining consists of using/developing **data mining algorithms** to discover interesting, unexpected and useful patterns in databases.

**Pattern mining algorithms** can be applied on **various types of data** such as transaction databases, sequence databases, streams, strings, spatial data, and graphs.

Pattern mining algorithms can be designed to discover **various types of patterns **such as subgraphs, associations, indirect associations, trends, periodic patterns, sequential rules, lattices, sequential patterns, and high-utility itemsets.

But what is an **interesting pattern**? There are several definitions. For example, some researchers define an interesting pattern as a pattern that appears *frequently *in a database. Other researchers wants to discover *rare patterns*, patterns with a high *confidence*, the top patterns, etc.

In the following, I will give examples of two types of patterns that can be discovered from a database.

**Example 1. Discovering frequent itemsets**

The most popular algorithm for pattern mining is without a doubt **Apriori** (1993). It is designed to be applied on a transaction database to discover patterns in transactions made by customers in stores. But it can also be applied in several other applications. A transaction is defined a set of distinct items (symbols). Apriori takes as input (1) a *minsup* threshold set by the user and (2) a transaction database containing a set of transactions. Apriori outputs all **frequent itemsets**, i.e. groups of items shared by no less than *minsup* transactions in the input database. For example, consider the following transaction database containing four transactions. Given a *minsup* of two transactions, frequent itemsets are “bread, butter”, “bread milk”, “bread”, “milk” and “butter”.

T1: bread, butter, spinach

T2: butter, salmon

T3: bread, milk, butter

T4: cereal, bread milk

**Fig. 1.** a transaction database

Apriori can also apply a post-processing step to generate “association rules” from frequent itemsets, which I will not discuss here.

The **Apriori algorithm** has given rise to multiple algorithms that address the same problem or variations of this problem such as to (1) incrementally discover frequent itemsets and associations , (2) to discover frequent subgraphs from a set of graphs, (3) to discover subsequences common to several sequences, etc.

*R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.*

**Example 2. Discovering sequential rules**

The second example that I will give is to **discover sequential rules** in a** sequence database**. A sequence database is defined as a set of sequences. A sequence is a list of transactions (as previously defined). For example in the left part of the following figure a sequence database containing four sequences is shown. The first sequence contains item *a* and *b* followed by *c*, followed by *f*, followed by *g*, followed by *e*. A **sequential rule** has the form X –> Y where X and Y are two distinct non empty sets of items. The meaning of a rule is that if the items of X appears in a sequence in any order, they will be followed by the items of Y in any order. The **support of a rule** is the number of sequence containing the rule divided by the total number of sequences. The **confidence of a rule** is the number of sequence containing the rule divided by the number of sequences containing its antecedent. The goal of **sequential rule mining** is to discover all **sequential rules** having a support and confidence no less than two thresholds given by the user named “minsup” and “minconf” . For example, on the right part of the following figure some sequential rules are shown for *minsup=0.5 *and* minconf=0.5,* discovered by the **RuleGrowth** **algorithm**.

For more details about sequential rule mining, this paper presents the RuleGrowth algorithm (I’m the author of that paper).

*Fournier-Viger, P., Nkambou, R. & Tseng, V. S. (2011). RuleGrowth: Mining Sequential Rules Common to Several Sequences by Pattern-Growth. Proceedings of the 26th Symposium on Applied Computing (ACM SAC 2011). ACM Press, pp. 954-959.*

Pattern mining is often viewed as techniques to explain the past by discovering patterns. However, patterns found can also be used for prediction. As an example of application, the following paper shows how sequential rules can be used for predicting the next webpages that will be visited by users on a website, with a higher accuracy than using sequential patterns (named “classic sequential rules” in that paper):

Fournier-Viger, P. Gueniche, T., Tseng, V.S. (2012). **Using Partially-Ordered Sequential Rules to Generate More Accurate Sequence Prediction**. Proc. 8th International Conference on Advanced Data Mining and Applications (ADMA 2012), Springer LNAI 7713, pp. 431-442.

**Implementations of pattern mining algorithms**

If you want to know more information about pattern mining, most of the general data mining books such as the book of Han & Kamber and Tan, Steinbach & Kumar have at least one chapter devoted to pattern mining.

If you want to test pattern mining algorithms, I recommend to have a look at the SPMF data mining library (I’m the project founder), which offers the Java source code of more than 120 pattern mining algorithms, with simple examples, and a simple command line and graphical user interface for quickly testing the algorithms.

**Conclusion**

Pattern mining is a subfield of data mining that has been active for more than 20 years, and is still very active. Pattern mining algorithms have a wide range of applications. For example, the Apriori algorithm can also be applied to optimized bitmap index of data wharehouse. In this blog post, I have given two examples to give a rough idea of what is the goal of pattern mining. However, note that it is not possible to summarize twenty years of research in a single blog post!

If you want to continue reading on this topic, you may read my survey on sequential pattern mining and my survey on itemset mining, which gives a good introduction to the topic of discovering frequent patterns in sequences (sequential patterns) and transaction databases.

Hope that this post was interesting! If you like this blog, you can subscribe to my RSS Feed or Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

–

P. Fournier-Viger is the founder of the Java open-source data mining software SPMF, offering more than 130 data mining algorithms.