Today, I will do a quick post on how to automatically adjust the minimum support threshold of frequent pattern mining algorithms such as Apriori, FPGrowth and PrefixSpan according to the size of the data.
The problem is simple. Let’s consider the Apriori algorithm. The input of this algorithm is a transaction database and a parameter called minsup that is a value in [0,1]. The output of Apriori is a set of frequent patterns. A frequent pattern is a pattern such that its support is higher or equal to minsup. The support of a pattern (also called “frequency”) is the number of transactions that contains the pattern divided by the total number of transactions in the database. A key problem for algorithms like Apriori is how to choose a minsup value to find interesting patterns. There is no really easy way to determine the best minsup threshold. Usually, it is done by trial and error. Now let’s say that you have determined that for your application the best minsup.
Now, consider that the size of your data can change over time. In this case how can you dynamically adjust the minimum support so that when you don’t have a lot of data the threshold is higher? and that when you have more data, the threshold becomes lower ?
A simple solution that I have found is to use a mathematical function for adjusting the minsup threshold automatically according to the database size (the number of transactions for Apriori). This solution is shown below.
I choose to use the formula minsup(x) = (e^(-ax-b))+c where x is the number of transactions in the database and a,b,c are positive constants. This allows to set minsup to a high value when there is not a lot of data and then decrease minsup when there is more data. For example, on the first chart below, minsup value is set to 0.7 if there is 1 transaction, it becomes 0.4 when there is 3 transactions and then decrease to 0.2 when there is around 9 transactions. This allow the minsup threshold to become more strict when there is less data. Note that the constants a,b and c can be adjusted to make the curve behave differently.
On the second chart above, I show the relative minsup threshold. It is simply the minsup threshold multiplied by the database size. It shows the number of transactions in which a pattern need to appear to become frequent according to the database size.
What is the meaning of the constants a, b, and c?
The constant c is the smallest value that this function will produce. For example, if c = 0.2, the function will not generate minsup values that are less than 0.2. The constant a and b influences how quickly the curve will decrease when x increases.
How do we call this function?
In mathematics, this type of function is called an exponential decay function as it is exponentially decreasing when x increases. The idea of using this function for pattern mining was first proposed in my Ph.D thesis:
Fournier-Viger, P. (2010), Un modèle hybride pour le support à l’apprentissage dans les domaines procéduraux et mal-définis. Ph.D. Thesis, University of Quebec in Montreal, Montreal, Canada, 184 pages.
Hope this little trick is helpful!
Next time, I will try to talk about some more general data mining topic. I promised that I would do that last time. But today I wanted to talk about this topic, so I will rather do that next time!
Founder of the SPMF data mining software