In this blog post, I will give an introduction to the task of** sequence prediction**, a popular data mining/machine learning task, which consist of **predicting the next symbol of a sequence of symbols**. This task is important as it have many real-life applications such as webpage prefetching and product recommendation.

**What is a sequence?**

Before defining the problem of **sequence prediction**, it is necessary to first explain what is a sequence. A **sequence** is an ordered list of symbols. For example, here are some common types of sequences:

- A sequence of webpages visited by a user, ordered by the time of access.
- A sequence of words or characters typed on a cellphone by a user, or in a text such as a book.
- A sequence of products bought by a customer in a retail store
- A sequence of proteins in bioinformatics
- A sequence of symptoms observed on a patient at a hospital

Note that in the above definition, we consider that a sequence is a list of symbols and do not contain numeric values. A sequence of numeric values is usually called a time-series rather than a sequence, and the task of predicting a time-series is called time-series forecasting. But this is another topic.

**What is sequence prediction?**

The task of** sequence prediction** consists of predicting the next symbol of a sequence based on the previously observed symbols. For example, if a user has visited some webpages A, B, C, in that order, one may want to predict what is the next webpage that will be visited by that user to prefetch the webpage.

*An illustration of the problem of sequence prediction*

There are **two steps to perform sequence prediction**:

- First, one must
**train a sequence prediction model**using some previously seen sequences called the training sequences. This process is illustrated below:For example, one could train a sequence prediction model for webpage prediction using the sequences of webpages visited by several users.

- The
**second step**is to use a trained sequence prediction model to**perform prediction**for new sequences (i.e. predict the next symbol of a new sequence), as illustrated below.For example, using a prediction model trained with the sequences of webpages visited by several users, one may predict the next webpage visited by a new user.

**An overview of state-of-the-art sequence prediction models**

Having defined what are the main** sequence prediction models**, that could be used in an application?

There are actually **numerous models** that have been proposed by researchers such as DG, All-k-order Markov, TDAG, PPM, CPT and CPT+. These models utilize various approaches. Some of them uses for example neural networks, pattern mining and a probabilistic approach.

**How to determine if a sequence prediction model is good?**

Various **sequence prediction models** have different advantages and limitations, and may perform more or less well on different types of data. Typically a sequence prediction model is evaluated in terms of criteria such as **prediction accuracy**, the **memory** that it uses and the **execution time **for training and performing predictions.

Several benchmark have been done in the literature to compare various models. For example, here is a recent benchmark performed by my team in our PAKDD 2015 paper about sequence prediction with CPT+.

In this benchmark, we compare our proposed **CPT+ sequence prediction model** with several state-of-the-art models on various types of data. Briefly, BMS, MSNBC, Kosarak and Fifa are sequences of webpages. SIGN is a sign-language dataset. Bible word and Bible char are datasets of sequences of Words and characters. As it can be seen, for this type of data at least, CPT+ greatly outperform other models. There are several reasons. One of them is that several models such as DG assume the Markovian hypothesis that the next symbol only depends on the previous symbols. Another reason is that the CPT+ model use an efficient indexing approach to consider all the relevant data for each prediction (see details in the paper).

**Where can I get open-source implementations of sequence prediction models?**

Some **open-source and Java implementation** of the seven discussed **sequence prediction models** (DG, AKOM, TDAG, LZ78, PPM, CPT, CPT+) can be found in the SPMF open-source data mining library, which includes the implementations from the IPredict project.

There are extensive documentation about how to use these models on the SPMF website. Here I will provide a quick example that shows how the CPT model can be easily applied with just a few lines of code.

// Phase 1: Training // Load a file containing the training sequences into memory SequenceDatabase trainingSet = new SequenceDatabase(); trainingSet.loadFileSPMFFormat("training_sequences.txt", Integer.MAX_VALUE, 0, Integer.MAX_VALUE); // Train the prediction model String optionalParameters = "splitLength:6 recursiveDividerMin:1 recursiveDividerMax:5"; CPTPredictor predictionModel = new CPTPredictor("CPT", optionalParameters); predictionModel.Train(trainingSet.getSequences()); // Phase 2: Sequence prediction // We will predict the next symbol after the sequence <1,4> Sequence sequence = new Sequence(0); sequence.addItem(new Item(1)); sequence.addItem(new Item(4)); Sequence thePrediction = predictionModel.Predict(sequence); System.out.println("For the sequence <(1),(4)>, the prediction for the next symbol is: +" + thePrediction);

Thus, without going into the details of each prediction models, it can be seen that it is very easy to train a sequence prediction model and use it to perform predictions.

**Conclusion**

This blog post has introduced the task of **sequence prediction**, which has many applications. Furthermore, implementations of open-source sequence prediction models have been presented.

==

**Philippe Fournier-Viger** is a full professor and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account **@philfv **to get notified about new posts.

Great! Thanks a lot

Great article!

Quick question..

Is sequential pattern mining a prerequisite for sequence prediction? A first step for sequence prediction could be to identify the various possible sequences, so can we use the sequence pattern mining for that?

Hello,

Thanks. Yes you could use sequential patterns or sequential rules. Some people have done that. For example, in my ADMA paper from 2012, I have compared different types of sequential rules for sequence prediction:

Fournier-Viger, P. Gueniche, T., Tseng, V.S. (2012). Using Partially-Ordered Sequential Rules to Generate More Accurate Sequence Prediction. Proc. 8th Intern. Conference on Advanced Data Mining and Applications (ADMA 2012), Springer LNAI 7713, pp. 431-442.

However, a problem with this idea of using patterns for predictions, is that you need to set the minimum support. If you set this parameter to 20% for example, then you will miss all the rare cases that have a support below 20 %, and thus the prediction for these rare cases may not be good. But rare cases are also important if you want your model to have a high prediction accuracy. In other words, using patterns for predictions can be done, but it will work best for frequent cases and may not work well for rare cases because no pattern will be available. In some case, if you don’t have patterns, you cannot even make a prediction.

For this reason, in my work after that, my student T. Gueniche have stopped using patterns and have designed CPT and CPT+. The idea of these models is that we do not need patterns because if we use patterns we loose information. Instead, we keep all the sequences and use some special structure to quickly find all the relevant sequences for making a given predictions. Thus, this approach can work both for frequent and rare cases. CPT and CPT+ were published in these papers:

Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V. S. (2015). CPT+: Decreasing the time/space complexity of the Compact Prediction Tree. Proc. 19th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 2015), Springer, LNAI9078, pp. 625-636.

Gueniche, T., Fournier-Viger, P., Tseng, V. S. (2013). Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction. Proc. 9th Intern. Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 177-188.

Using CPT and CPT+ we obtain better prediction results than using sequential patterns/rules.

So in my opinion, it is better to use an approach without patterns to get the best predictions.

If I have a set of random numbers, can I predict the next number using sequence prediction??

Hello, Yes you could make predictions from a sequence of random numbers. But the result would be meaningless because the numbers are random.

Pingback: Making Predictions with Sequences | A bunch of data

Pingback: Report about the DEXA 2018 and DAWAK 2018 conferences - The Data Mining BlogThe Data Mining Blog

Hey that’s a great blog!

I’m Looking for a solution to this problem.

I have a file with 200 data entries. Each data entry is a sequence of numbers. Example:

data = [2, 4, 1, 20, 50, 13], [1, 20, 130, 140, 2, 24], [2, 30, 40, 1, 9,4].

Each data entry consists of 6 numbers. A number can be any integer number between 1 and 150.

I want to predict the 100 most possible new data entries (sequences). Based on the training model, it should make predictions of possible new entries.

How do you solve this?

Can this be done with CPT?

Pingback: The PAKDD 2015 Conference (a brief report) - The Data Mining Blog

Great article!! Thanks for sharing the knowledge.

I hope it is not too late to ask a question.

Is it possible to improve the sequence prediction using some other data apart from the sequence? For example predicting the next symptoms on a patient at a hospital not only using the sequences from other patients but the patient age, sex, etc.

If so, could you point me to any example or the general direction I should investigate? Thanks in advance.

Hi, Thanks for reading. Glad it is useful. Yes sure, it would be possible to consider other information to do the sequence prediction. I think that this could help to perform better predictions. The above software in SPMF, do not handle additional information (just sequences of symbols), but if you have time to extend the models, it is certainly possible to do. It would require to think about what is the best way to extend the model with this additional information, and then do the programming. I think it is possible as a project.

Best regards,