Today, I will briefly explain what is a **generator itemset**. I will give some example and explain why **generator itemsets **are interesting and useful for some applications. I will also mention that efficient implementations can be found in the SPMF data mining library (which I am the founder).

## What is a generator itemset?

In data mining, a popular task is frequent itemset mining. It consists of finding **frequent itemsets** in a database. A **frequent itemset **is a set of values that frequently appear together in records of a database. The classical application of **itemset mining **is to analyze transaction databases to study the behavior of customers in a store. In that context, each record is a customer transaction, and the goal is to find sets of products bought together by at least *minsup* customers where *minsup *is a threshold set by the user.

For example, consider the following transaction database containing five transactions:

Transaction 1 | A, B, C, D

Transaction 2 | A, B, C, D

Transaction 3 | A, C

Transaction 4 | B, C

Transaction 5 | A

where **A, B, C, D** represent the purchase of **A**pple, **B**read, **C**ake, and **D**ates respectively. The **support of an itemset **is the number of transactions that contain it. For instance, the support of {A} is 4, the support of {B} is 3, and the support of {A, B} is 2. A frequent itemset is an itemset whose support is greater than or equal to a minimum support threshold. Let’s assume that the minimum support is 2. Then, the frequent itemsets are:

{A} support : 4

{B} support: 3

{C} support : 4

{D} suport : 2

{A, B} support : 2

{A, C} support : 3

{A, D} support: 2

{B, C} support : 3

{B, D} support: 2

{C, D} support: 2

{A, B, C} support: 2

{A, B, D} support: 2

{A, C, D} support: 2

{B, C, D} support: 2

{A, B, C, D} support: 2

Among these frequent itemsets, some are **generators** and some are not. A (frequent) **generator itemset** is a frequent itemset that does not have any subset with the same support. For example, {A,B,C,D} is not a generator because it has a subset {A} with the same support (2). However, {A, B} is a generator because none of its subsets ({A}, {B}) has the same support (2). The generator itemsets in this example are:

{A} support : 4

{B} support: 3

{C} support : 4

{D} suport : 2

{A, B} support : 2

{A, C} support : 3

As it can be observed, frequent generator itemsets are a subset of all frequent itemsets.

## Why are generator itemsets useful?

**Generator itemsets** (also called key itemsets) are useful for several reasons. First, they can be used to recover all the frequent itemsets by applying a closure operator. The closure operator takes a generator itemset and adds all the items that appear in every transaction containing that generator. For example, the closure of {A, B} is {A, B, C,D} because {A, B} appears in transactions 1 and 2, and both transactions also contain C and D. By applying the closure operator to all the generators, we can recover all the frequent itemsets.

Second, generator itemsets can help us understand the structure of the data. **Generator itemsets** represents the smallest sets of values that are common to some sets of transactions. In the context of shopping, it means that smallest sets of products that are common to some groups of customers. Generators are also often used for classification because they allow to extract the smallest sets of features that describe groups of instances (transactions).

Third, discovering generator itemsets can reduce the number of itemsets that are discovered. By mining generator itemsets instead of all frequent itemsets, we can find only the smallest itemsets that have a given support, which is more concise and can be more meaningful。

## How to mine generator itemsets?

There are several algorithms that can be used to mine generator itemsets from a transactional database. One of them is **Pascal**, which is based on the Apriori algorithm. Pascal works by generating candidate itemsets of increasing size and checking their support in the database. However, unlike Apriori, Pascal only generates candidates that are generators or potential generators. The key property used by Pascal to reduce the search space is that if an itemset is not a generator than all its supersets are also not generators.

## Efficient implementations of algorithms

If you want to try discovering generator itemsets and other kinds of patterns, the **SPMF open-source library** more than a hundred algorithms for mining itemsets and other patterns, including generator itemsets, frequent itemsets, and closed itemsets. SPMF is easy to use and provides a graphical user interface and a command line interface.

To use SPMF to mine generator itemsets from a transactional database, we need to follow these steps:

- Download and install SPMF from https://www.philippe-fournier-viger.com/spmf/
- Prepare an input file in the SPMF format (a text file where each line represents a transaction and items are separated by spaces)
- Choose an algorithm for mining generator itemsets (such as Pascal and DefMe) and set the parameters (the minimum support threshold).
- Run the algorithm and open the output text file.

For more details and examples on how to use SPMF to mine generator itemsets or other patterns from data, please refer to the documentation and tutorials on https://www.philippe-fournier-viger.com/spmf/

## A video

Lastly, if you are interested by this topic, I have also uploaded a video where I explain in more details what are **generator itemsets, closed itemsets** and **maximal itemsets**. This video is part of my free pattern mining course, and can be watched here: Maximal, Closed and Generator itemsets or from my youtube channel here.

## Conclusion

In this blog post, I have explained what is a **generator itemset **and why it is useful for data analysis. Hope that it was interesting. 😉

==**Philippe Fournier-Viger **is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Pingback: What is a maximal itemset? | The Data Blog