In this blog post, I will briefly discuss the **fact **that the popular **CloSpan algorithm** for **frequent sequential pattern mining **is an **incomplete algorithm**. This means that in some special situations, **CloSpan** does not produce the expected results that it has been designed for, and in particular some patterns are missing from the final result. I will try to give a brief explanation of that problem but I will refer to another paper for the actual proof that the algorithm is **incomplete**. Moreover, I will briefly talk about a similar problem in the **IncSpan** algorithm.

**What is CloSpan?**

**CloSpan** is one of the most famous algorithm for** sequential pattern mining**. It is designed for discovering subsequences that appear frequently in a set of sequence. **CloSpan** was published in 2003 in the famous SIAM Data Mining conference:

[1] Yan, X., Han, J., & Afshar, R. (2003, May).

CloSpan: Mining: Closed sequential patterns in large datasets.InProceedings of the 2003 SIAM international conference on data mining(pp. 166-177). Society for Industrial and Applied Mathematics.

The SIAM DM conference is a top conference. And the CloSpan paper has been **cited in more than 940 papers**, and used by many researchers.

**What is CloSpan used for? **

The **CloSpan** algorithm takes as input a parameter called * minsup * and a

**sequence database**(a set of sequences). A

**sequence**is an ordered list of itemsets. An itemset is a set of symbols. For example, consider the following sequence database, which contains three sequences:

Sequence 1: <(bread, apple),(milk, apple), (bread)>

Sequence 2: <(orange, apple),(milk, apple, bread), (bread, citrus)>

Sequence 3: <(bread, apple,citrus),(orange, apple), (bread)>

Here the first sequence (Sequence 1) means that a customer has bought the items bread and apple together, has then bought milk and apple together, and then has bought bread. Thus, this sequence contains three itemsets. The first itemset is bread with apple. The second itemset is milk with apple. The third itemset is bread.

In **sequential pattern mining** the goal is to find all subsequences that appears in at least *minsup *sequences of a database (all the **sequential patterns**). For example, consider the above database and that *minsup = 2 sequences*. Then, subsequence <(bread, apple),(apple)(bread)> is a sequential pattern since it appears in two sequences of the above database, which is no less than *minsup*. A sequential pattern mining algorithm should be able to find all subsequences that meet this criterion, that is to find all **sequential patterns**. Above I have given a quite informal definition with an example. If you want more details about this problem, you could read my survey paper about sequential pattern mining, which gives a more detailed overview of this problem.

The **CloSpan** algorithm is an algorithm to find a special type of sequential patterns called the **closed sequential patterns **(see the above survey for more details).

**What is the problem? **

In algorithmic two important aspects are often discussed when talking about some algorithms. They are whether an algorithm is **correct** or **incorrect**, and whether it is **complete** or **incomplete**. For the problem of sequential pattern mining, a complete algorithm is an algorithm that can find all the required sequential patterns, and a correct algorithm is one that will correctly calculate the number of occurrences of each pattern found (correctly calculate the support).

Now the problem with **CloSpan** is that it an incomplete algorithm. It applies some theoretical results (**Theorem 1, Lemma 3 and Corollary 1-2 **in** [1]) **to reduce the search space, which work well for some databases containing only sequences of itemsets containing a single item each. But in some cases where a database contains sequences of itemsets containing more than one item per itemset, the algorithm can miss some sequential patterns. In other words, the **CloSpan** algorithm is incomplete. In particular, in the following paper [2], it was shown that the **Theorem 1, Lemma 3 and Corollary 1-2 of the original CloSpan paper are incorrect**.

[2] Le, B., Duong, H., Truong, T., & Fournier-Viger, P. (2017). FCloSM, FGenSM: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy. *Knowledge and Information Systems*, *53*(1), 71-107.

In [2], this was shown with an example (Example 2) and it was also shown experimentally that if the original Corollary 1 of CloSpan is used, patterns can be missed. For example, on a dataset called D0.1C8.9T4.1N0.007S6I4, CloSpan can find 145,319 patterns, while other algorithms for closed sequential pattern mining can find 145,325 patterns. Thus, patterns can be missed, although not necessarily many.

I will not discussed the details about why the results are incomplete as it would requires too much explanations for a blog post. But for those interested, you can check the above paper [2] which has clearly explained the problem.

**And what about other algorithms?**

In this blog post, I have discussed a theoretical problem in the **CloSpan** algorithm such that **CloSpan** is **incomplete** in some cases. This result was presented in the paper [2]. But to ensure more visibility for this result, as CloSpan is a popular algorithm, I thought that it would be a good idea to write a blog post about this result.

By the way, this is not the only sequential pattern mining algorithm that is incorrect or incomplete. **Another famous sequential pattern mining that is incorrect** is **IncSpan **for **incremental sequential pattern mining**, published in the KDD 2004 conference, and cited in more than 240 papers:

[3] Cheng, H., Yan, X., & Han, J. (2004, August). IncSpan: incremental mining of sequential patterns in large database. In *Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining* (pp. 527-532). ACM.

After the paper about **IncSpan** was published, the following year, it was shown in a PAKDD paper[4] that **IncSpan is incomplete** (it can miss some patterns). In that same paper, a complete algorithm called **IncSpan+** was proposed. This is the paper [4] that has demonstrated that **IncSpan** is incomplete:

Nguyen, S. N., Sun, X., & Orlowska, M. E. (2005, May). Improvements of IncSpan: Incremental mining of sequential patterns in large database. In *Pacific-Asia Conference on Knowledge Discovery and Data Mining* (pp. 442-451). Springer, Berlin, Heidelberg.

**Conclusion**

The lesson to learn from this is that it is important to check the correctness and completeness of algorithms carefully before publishing a paper. And it also highlights that errors can also be found even in papers published in the very top conferences of the field, as reviewers often do not have time to check all the details of a paper, and papers sometimes do not include a detailed proof that algorithms are correct and complete. But although I have highlighted two cases with **CloSpan** and **IncSpan**, there are not the only papers to have theoretical problems. It happens quite often, actually.

==

**Philippe Fournier-Viger** is a full professor and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account **@philfv **to get notified about new posts.