Today, I write a post to announce a new version of the SPMF Java open-source data mining library. It is SPMF version 0.95 and it is a major revision. It offers 11 new data mining algorithms for various data mining tasks.
The list of the new algorithms is as follows:
- TKS for top-k sequential pattern mining
- TSP for top-k sequential pattern mining
- VMSP for maximal sequential pattern mining
- MaxSP for maximal sequential pattern mining
- estDec algorithm for mining recent frequent itemsets from a stream (by Azadeh Soltani)
- MEIT (Memory Efficient Itemset-Tree), a data structure for targeted association rule mining
- CM-SPAM for sequential pattern mining
- CM-SPADE for sequential pattern mining
- CM-CLaSP for closed sequential pattern mining
- PASCAL for mining frequent itemsets and identifying generators
Below, I will briefly explain the importance of these new algorithms.
Algorithms for maximal sequential pattern mining
Mining sequential patterns can often produce a lot of sequential patterns. To address this issue, several algorithms have been propoved to mine less but more representative sequential patterns. Some popular subsets of sequential patterns are closed and maximal sequential patterns. In previous version of SPMF there was algorithms for closed sequential pattern mining but no algorithms for maximal sequential pattern mining. Now, two recent state-of-the art algorithms have been added for this task : VMSP (2014) and MaxSP (2013).
It is interesting to mine maximal sequential patterns because it can produce up to an order of magnitude less patterns than closed sequential patterns.
Faster algorithms for sequential pattern mining
The new version also offers faster algorithms for sequential pattern mining and closed sequential pattern mining.
CM-SPADE (2014) outperform all the previous algorithms in SPMF in most cases. CM-SPAM (2014) offers most of the time the second best performance.
Furthermore, the CM-ClaSP algorithm (2014) outperforms the CloSpan, ClaSP and BIDE+ algorithms for closed sequential pattern mining.
You can see a performance comparison below on six public datasets to see the speed improvement from these new algorithms (more details on the “performance” page of the SPMF website).
New algorithm for mining frequent itemsets from a stream
A new algorithm for stream mining is also introduced. It is the estDec algorithm (2003). It is a classical algorithm for mining frequent itemsets from stream that put more importance on recent transactions. Previously, there was only one stream mining algorithm in SPMF named CloStream. But CloStream has some important limitations: (1) it does not use a minimum support threshold therefore it can find a huge amount of patterns and (2) it put as much importance on old transactions as on recent transactions from the stream. estDec adresses these issues by using a threshold and putting less importance on older transactions.
New algorithm for frequent itemset mining
An implementation of the PASCAL algorithm is introduced for frequent itemset mining. PASCAL is a classical algorithm based on Apriori. The main improvement in PASCAL over Apriori is that it can correctly guess the support of some itemsets directly without having to scan the database. This is done based on the following property from lattice theory: when an itemset is not a “generator”, its support is the minimum support of its subsets. I will not define the concept of generator here. If you are curious, you can have a look at the example for Pascal in the documentation of SPMF for more details.
New algorithm for targeted itemset and association rule mining
A new data structure is also introduced named the Memory Efficient Itemset Tree (MEIT). This structure is an improvement of the Itemset-Tree structure that can use up to 45 % less memory but is a little bit slower (there is a trade-off between memory and speed). If you don’t know what is an Itemset-Tree, it is a structure for performing queries about itemsets and association rules. You can imagine for example a software that has to perform many queries such as “give me all the association rules with itemset X in the consequent”, “give me all the rules with item Z in the consequent”, etc. An Itemset-Tree is a structure that is optimized for answering such queries quickly and that can be updated incrementally by inserting new transactions. If you are curious about this, you can have a look at the example for Memory-ItemsetTree and Itemset Tree in the documentation of SPMF for more details.
Bug fixes and other improvements
Lastly, this new version also include a few bug fixes. One notable bug fix is for a bug in the ID3 implementation. I have also improved the documentation to explain clearly what are the input and output format of each algorithm. I have also updated the map of data mining algorithms offered in SPMF:
That is all I wanted to write for today! Hope that you will enjoy this new version of SPMF !
If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.