In this post, I will provide **two standard benchmark datasets** that can be used for **frequent subgraph mining.** Moreover, I will provide **a set of small graph datasets** that I have created for debugging **subgraph mining algorithms**.

**The format of graph datasets**

A **graph dataset** is a text file which contains one or more graphs. A graph is defined by a few lines of text that follow the following format (used by the GSpan algorithm)**t # N ** This is the first line of a graph. It indicates that this is the N-th graph in the file

**v M L** This line defines the M-th vertex of the current graph, which has a label L

**e P Q L ** This line defines an edge, which connects the P-th vertex with the Q-th vertex. This edge has the label L

**Five small datasets**

Here are five small datasets that I have created for debugging frequent subgraph mining algorithms. Each dataset contains a single graph, which is enough for some small debugging tasks.

Content of the file:

t # 1 v 0 10 v 1 11 v 2 12 e 0 1 21 e 2 1 21

Visual representation:

(L10)---L21---(L11)---- L21 ---- (L12)0 1 2

Content of the file:

t # 1 v 0 10 v 1 11 v 2 10 v 3 10 e 0 1 21 e 2 1 21 e 1 3 21

Visual representation:

(L10)--- L21 ---(L11)--- L21 ----(L10)0 1 2 | | L21 | |(L10)3

Content of the file:

t # 1 v 0 10 v 1 10 v 2 10 e 0 1 20 e 1 2 20 e 2 0 20

Visual representation:

(L10)----(L11)----(L10)0 1 2

Content of the file:

t # 1 v 0 10 v 1 10 v 2 11 v 3 11 e 0 1 21 e 0 2 20 e 1 3 20 e 2 3 22 e 1 2 23

Visual representation:

(L10)------- L20 ------(L11)| / | | / | | / | L21 / | | L23 L22 | / | | / | | / | | / |(L10)------ L20 --------(L11)

Content of the file:

t # 1 v 0 10 v 1 10 v 2 11 v 3 11 e 0 2 20 e 1 3 20 e 1 2 20

Visual representation:

(10)-- 20 --(11)-- 20 –(10)–-- 20 –---(11)0 2 1 3

**Two standard benchmark datasets**

Moreover, here are two popular datasets that are used in frequent sub-graph mining (I have obtained them from the GitHub website):

Coumpound_422.txt : It contains 422 graphs

Chemical_340.txt : It contains 340 graphs

**Want to try frequent subgraph mining?**

If you want to try frequent subgraph mining algorithms, some public fast Java open-source implementations of **TKG** for** top-k frequent subgraph mining** and **gSpan **are available in the **SPMF data mining library.**

**Conclusion**

In this blog post, I have share some helpful datasets. If you want to know more about subgraph mining you may read my short introduction to subgraph mining.

—**Philippe Fournier-Viger** is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.