In this blog post, I will briefly discuss the fact that the popular CloSpan algorithm for frequent sequential pattern mining is an incomplete algorithm. This means that in some special situations, CloSpan does not produce the expected results that it has been designed for, and in particular some patterns are missing from the final result. I will try to give a brief explanation of that problem but I will refer to another paper for the actual proof that the algorithm is incomplete. Moreover, I will briefly talk about a similar problem in the IncSpan algorithm.
What is CloSpan?
CloSpan is one of the most famous algorithm for sequential pattern mining. It is designed for discovering subsequences that appear frequently in a set of sequence. CloSpan was published in 2003 in the famous SIAM Data Mining conference:
[1] Yan, X., Han, J., & Afshar, R. (2003, May). CloSpan: Mining: Closed sequential patterns in large datasets. In Proceedings of the 2003 SIAM international conference on data mining (pp. 166-177). Society for Industrial and Applied Mathematics.
The SIAM DM conference is a top conference. And the CloSpan paper has been cited in more than 940 papers, and used by many researchers.
What is CloSpan used for?
The CloSpan algorithm takes as input a parameter called minsup and a sequence database (a set of sequences). A sequence is an ordered list of itemsets. An itemset is a set of symbols. For example, consider the following sequence database, which contains three sequences:
Sequence 1: <(bread, apple),(milk, apple), (bread)>
Sequence 2: <(orange, apple),(milk, apple, bread), (bread, citrus)>
Sequence 3: <(bread, apple,citrus),(orange, apple), (bread)>
Here the first sequence (Sequence 1) means that a customer has bought the items bread and apple together, has then bought milk and apple together, and then has bought bread. Thus, this sequence contains three itemsets. The first itemset is bread with apple. The second itemset is milk with apple. The third itemset is bread.
In sequential pattern mining the goal is to find all subsequences that appears in at least minsup sequences of a database (all the sequential patterns). For example, consider the above database and that minsup = 2 sequences. Then, subsequence <(bread, apple),(apple)(bread)> is a sequential pattern since it appears in two sequences of the above database, which is no less than minsup. A sequential pattern mining algorithm should be able to find all subsequences that meet this criterion, that is to find all sequential patterns. Above I have given a quite informal definition with an example. If you want more details about this problem, you could read my survey paper about sequential pattern mining, which gives a more detailed overview of this problem.
The CloSpan algorithm is an algorithm to find a special type of sequential patterns called the closed sequential patterns (see the above survey for more details).
What is the problem?
In algorithmic two important aspects are often discussed when talking about some algorithms. They are whether an algorithm is correct or incorrect, and whether it is complete or incomplete. For the problem of sequential pattern mining, a complete algorithm is an algorithm that can find all the required sequential patterns, and a correct algorithm is one that will correctly calculate the number of occurrences of each pattern found (correctly calculate the support).
Now the problem with CloSpan is that it an incomplete algorithm. It applies some theoretical results (Theorem 1, Lemma 3 and Corollary 1-2 in [1]) to reduce the search space, which work well for some databases containing only sequences of itemsets containing a single item each. But in some cases where a database contains sequences of itemsets containing more than one item per itemset, the algorithm can miss some sequential patterns. In other words, the CloSpan algorithm is incomplete. In particular, in the following paper [2], it was shown that the Theorem 1, Lemma 3 and Corollary 1-2 of the original CloSpan paper are incorrect.
[2] Le, B., Duong, H., Truong, T., & Fournier-Viger, P. (2017). FCloSM, FGenSM: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy. Knowledge and Information Systems, 53(1), 71-107.
In [2], this was shown with an example (Example 2) and it was also shown experimentally that if the original Corollary 1 of CloSpan is used, patterns can be missed. For example, on a dataset called D0.1C8.9T4.1N0.007S6I4, CloSpan can find 145,319 patterns, while other algorithms for closed sequential pattern mining can find 145,325 patterns. Thus, patterns can be missed, although not necessarily many.
I will not discussed the details about why the results are incomplete as it would requires too much explanations for a blog post. But for those interested, you can check the above paper [2] which has clearly explained the problem.
And what about other algorithms?
In this blog post, I have discussed a theoretical problem in the CloSpan algorithm such that CloSpan is incomplete in some cases. This result was presented in the paper [2]. But to ensure more visibility for this result, as CloSpan is a popular algorithm, I thought that it would be a good idea to write a blog post about this result.
By the way, this is not the only sequential pattern mining algorithm that is incorrect or incomplete. Another famous sequential pattern mining that is incorrect is IncSpan for incremental sequential pattern mining, published in the KDD 2004 conference, and cited in more than 240 papers:
[3] Cheng, H., Yan, X., & Han, J. (2004, August). IncSpan: incremental mining of sequential patterns in large database. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 527-532). ACM.
After the paper about IncSpan was published, the following year, it was shown in a PAKDD paper[4] that IncSpan is incomplete (it can miss some patterns). In that same paper, a complete algorithm called IncSpan+ was proposed. This is the paper [4] that has demonstrated that IncSpan is incomplete:
Nguyen, S. N., Sun, X., & Orlowska, M. E. (2005, May). Improvements of IncSpan: Incremental mining of sequential patterns in large database. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 442-451). Springer, Berlin, Heidelberg.
Conclusion
The lesson to learn from this is that it is important to check the correctness and completeness of algorithms carefully before publishing a paper. And it also highlights that errors can also be found even in papers published in the very top conferences of the field, as reviewers often do not have time to check all the details of a paper, and papers sometimes do not include a detailed proof that algorithms are correct and complete. But although I have highlighted two cases with CloSpan and IncSpan, there are not the only papers to have theoretical problems. It happens quite often, actually.
==
Philippe Fournier-Viger is a full professor and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.