Today, I will do a quick post on how to automatically **adjust the minimum support threshold** of **frequent pattern mining algorithms** such as **Apriori, FPGrowth and PrefixSpan** **according to the size of the data.**

The problem is simple. Let’s consider the **Apriori** algorithm. The input of this algorithm is a **transaction database** and a parameter called **minsup **that is a value in** [0,1].** The output of Apriori is a set of frequent patterns. A **frequent pattern** is a pattern such that its **support** is higher or equal to **minsup**. The **support of a pattern** (also called “frequency”) is the number of transactions that contains the pattern divided by the total number of transactions in the database. A key problem for algorithms like **Apriori** is how to choose a minsup value to find interesting patterns. There is no really easy way to determine the best minsup threshold. Usually, it is done by trial and error. Now let’s say that you have determined that for your application the best **minsup**.

Now, consider that the size of your data can change over time. In this case how can you dynamically adjust the minimum support so that when you don’t have a lot of data the threshold is higher? and that when you have more data, the threshold becomes lower ?

**The solution**

A simple solution that I have found is to use a mathematical function for adjusting the minsup threshold automatically according to the database size (the number of transactions for **Apriori**). This solution is shown below.

I choose to use the formula * minsup(x) = (e^(-ax-b))+c * where

**is the number of transactions in the database and**

*x***are positive constants. This allows to set**

*a,b,c***to a high value when there is not a lot of data and then decrease**

*minsup***when there is more data. For example, on the first chart below,**

*minsup**value is set to 0.7 if there is 1 transaction, it becomes 0.4 when there is 3 transactions and then decrease to 0.2 when there is around 9 transactions. This allow the*

**minsup***threshold to become more strict when there is less data. Note that the constants*

**minsup****a**,

**b**and

**c**can be adjusted to make the curve behave differently.

On the second chart above, I show the **relative minsup threshold. **It is simply the minsup threshold multiplied by the database size. It shows the number of transactions in which a pattern need to appear to become frequent according to the database size.

**What is the meaning of the constants a, b, and c?**

The constant **c **is the smallest value that this function will produce. For example, if **c = 0.2**, the function will not generate ** minsup **values that are less than 0.2. The constant

**a**and

**b**influences how quickly the curve will decrease when

**increases.**

*x***How do we call this function?**

In mathematics, this type of function is called an *exponential decay function *as it is exponentially decreasing when *x *increases. The idea of using this function for pattern mining was first proposed in my Ph.D thesis:

Fournier-Viger, P. (2010), **Un modèle hybride pour le support à l’apprentissage dans les domaines procéduraux et mal-définis**. Ph.D. Thesis, University of Quebec in Montreal, Montreal, Canada, 184 pages.

**Conclusion**

Hope this little trick is helpful!

Next time, I will try to talk about some more general data mining topic. I promised that I would do that last time. But today I wanted to talk about this topic, so I will rather do that next time!

Philippe

Founder of the SPMF data mining software