Today, I will show how to use the open-source SPMF data mining software to discover sequential patterns in web log data. Then, I will show to how visualize the frequent sequential patterns found using GraphViz.
Step 1 : getting the data.
The first step is to get some data. To discover sequential patterns, we will use the SPMF software. Therefore the data has to be in SPMF format. In this blog post, I will just use the web log data from the FIFA world cup from the datasets webpage of the SPMF website. It is already in SPMF format.
Step 2 : extracting sequential patterns.
Then using the SPMF.jar file downloaded from the SPMF website, I have applied the PrefixSpan algorithm to discover frequent sequences of webpages visited by the users. I have set the minsup parameter of PrefixSpan to 0.15, which means that a sequence of webpages is frequent if it is visited by at least 15% of the users.
The result is an output file containing 5123 sequential patterns in a text file. For example, here are three patterns from the output file:
155 -1 44 -1 21 -1 #SUP: 4528
147 -1 #SUP: 7787
147 -1 59 -1 #SUP: 6070
147 -1 57 -1 #SUP: 6101
The first one represents users vising the webpage 155, then visiting webpage 44 and then 21. The line indicates that this patterns has a support of 4528, which means that this patterns is shared by 4528 users.
The result may be hard to understand in a text file, so we will next visualize them using GraphViz.
Step 3: Transforming the output file into GraphViz DOT format.
I use a very simple piece of Java code to transform the sequential patterns found by SPMF to the GraphViz DOT format.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
/**
* @author Philippe Fournier-Viger, 2014
*/
public class MainDatasetConvertSequentialPatternsToDotFormat {
public static void main(String [] arg) throws IOException, InterruptedException{
String input = "C:\\patterns\\test.txt";
String output = "C:\\patterns\\input.dot";
BufferedReader myInput = new BufferedReader(new InputStreamReader( new FileInputStream(new File(input))));
// map to remember arrow already seen
Map<String, String> mapEdges = new HashMap<String, String>();
// for each line (pattern) until the end of file
String thisLine;
while ((thisLine = myInput.readLine()) != null) {
if (thisLine.isEmpty() == true) {
continue;
}
// split the pattern according to the " " separator
String split[] = thisLine.split(" ");
boolean firstItemOfItemset = true;
boolean firstItemset = true;
String previousItemFromSameItemset = null;
// for each token
for(String token : split) {
if("-1".equals(token)) { // if end of an item
}else if("-2".equals(token) || '#' == token.charAt(0)){ // if end of sequence
previousItemFromSameItemset = null;
break;
}else { // if an item
if(previousItemFromSameItemset != null) {
mapEdges.put(previousItemFromSameItemset, token);
}
previousItemFromSameItemset = token;
}
}
}
BufferedWriter writer = new BufferedWriter(new FileWriter(output));
writer.write("digraph mygraph{");
// PRINT THE ARROWS
for(Entry<String, String> edge : mapEdges.entrySet()) {
writer.write(edge.getKey() + " -> " +edge.getValue() + " \n");
}
// Note: only sequential patterns of size >=2 are used to create the graph and
// patterns are assumed to have only one item per itemset.
writer.write("}");
myInput.close();
writer.close();
}
}
Step 4: Generating a graph using GraphViz
Then I installed GraphViz on my computer running Windows 7. GraphViz is a great software for the visualization of graphs and it is not very hard to use. The idea is that you feed GraphViz with a text file describing a graph and then he will automatically draw it. Of course there are many options that can be selected and here I just use the basic options.
I use the command: dot -Tpng patterns.dot > output.png"
This convert my DOT file to a graph in PNG format.
The result is the following (click on the picture to see it in full size).
The result is pretty interesting. It summarizes the 5123 patterns into an easy to understand graph. By looking at the graph we can easily see that many sequential patterns passes by the nodes 44 and 117, which must be very important webpages on the website.
Note that using this technique, we loose a little bit information. For example, the support of edges is not indicated on this graph. It may be possible to improve upon this visualization for example by using various colors to indicate the support or the number of patterns that pass by a node.
Hope that you have enjoyed this post. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.