Subgraph mining datasets

In this post, I will provide links to standard benchmark datasets that can be used for frequent subgraph mining. Moreover, I will provide a set of small graph datasets that can be used for debugging subgraph mining algorithms.

subgraph mining datasets

The format of graph datasets

graph dataset is a text file which contains one or more graphs.  A graph is defined by a few lines of text that follow the following format (used by the GSpan algorithm)

t # N    This is the first line of a graph. It indicates that this is the N-th graph in the file

v M L     This line defines the M-th vertex of the current graph, which has a label L

e P Q L   This line defines an edge, which connects the P-th vertex with the Q-th vertex. This edge has the label L

Small datasets for debugging

Here are some small datasets that can be used for debugging frequent subgraph mining algorithms. Each dataset contains one or two graphs, which is enough for some small debugging tasks.

1) single_graph1.txt 

Content of the file:

t # 1
v 0 10
v 1 11
v 2 12
e 0 1 21
e 2 1 21

Visual representation:

(L10) ---L21--- (L11) ---- L21 ---- (L12)
  0              1                   2

2) single_graph2.txt

Content of the file:

t # 1
v 0 10
v 1 11
v 2 10
v 3 10
e 0 1 21
e 2 1 21
e 1 3 21

Visual representation:

(L10) --- L21 --- (L11) --- L21 ---- (L10)
  0                 1                  2
                    |
                    |
                   L21
                    |
                    |
                  (L10)3

3) single_graph3.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 10
e 0 1 20
e 1 2 20
e 2 0 20

Visual representation:

 (L10)---- (L11) ---- (L10)
    0        1          2

4) single_graph4.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 11
v 3 11
e 0 1 21
e 0 2 20
e 1 3 20
e 2 3 22
e 1 2 23

Visual representation:

    (L10) ------- L20 ------ (L11)
      |                    /   |
      |                 /      |
      |              /         |
      L21          /           |
      |         L23           L22
      |        /               |
      |      /                 |
      |    /                   |
      |  /                     |
    (L10) ------ L20 -------- (L11)

5) single_graph5.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 11
v 3 11
e 0 2 20
e 1 3 20
e 1 2 20

Visual representation:

(10) -- 20 --  (11) -- 20 – (10) –-- 20 –---(11)
  0            2           1                3

6) One_graph.txt

Content of the file:

t # 0
v 0 0
v 1 1
v 2 2
v 3 3
v 4 2
v 5 0
v 6 1
e 0 1 0
e 1 2 1
e 0 2 2
e 2 3 3
e 3 4 4
e 4 5 2
e 4 6 1
e 5 6 0

Visual representation:

Large datasets for subgraph mining

Moreover, here are about 15 large sugraph datasets that are used in frequent sub-graph mining available at this webpage:

SPMF Public Datasets (webpage)

Want to try frequent subgraph mining?

If you want to try frequent subgraph mining algorithms, some public fast Java open-source implementations of TKG for top-k frequent subgraph mining, cgSpan and gSpan are available in the SPMF data mining library.

Conclusion

In this blog post, I have share some helpful datasets.  If you want to know more about subgraph mining you may read my short introduction to subgraph mining.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

This entry was posted in Big data, Data Mining and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *