Subgraph mining datasets

In this post, I will provide two standard benchmark datasets that can be used for frequent subgraph mining. Moreover, I will provide a set of small graph datasets that I have created for debugging subgraph mining algorithms.

subgraph mining datasets

The format of graph datasets

A graph dataset is a text file which contains one or more graphs.  A graph is defined by a few lines of text that follow the following format (used by the GSpan algorithm)

  • t # N    This is the first line of a graph. It indicates that this is the N-th graph in the file
  • v M L     This line defines the M-th vertex of the current graph, which has a label L
  • e P Q L   This line defines an edge, which connects the P-th vertex with the Q-th vertex. This edge has the label L

Five small datasets

Here are five small datasets that I have created for debugging frequent subgraph mining algorithms. Each dataset contains a single graph, which is enough for some small debugging tasks.

1) single_graph1.txt 

Content of the file:

t # 1
v 0 10
v 1 11
v 2 12
e 0 1 21
e 2 1 21

Visual representation:

(L10) ---L21--- (L11) ---- L21 ---- (L12)
  0              1                   2

2) single_graph2.txt

Content of the file:

t # 1
v 0 10
v 1 11
v 2 10
v 3 10
e 0 1 21
e 2 1 21
e 1 3 21

Visual representation:

(L10) --- L21 --- (L11) --- L21 ---- (L10)
  0                 1                  2
                    |
                    |
                   L21
                    |
                    |
                  (L10)3

3) single_graph3.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 10
e 0 1 20
e 1 2 20
e 2 0 20

Visual representation:

 (L10)---- (L11) ---- (L10)
    0        1          2

4) single_graph4.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 11
v 3 11
e 0 1 21
e 0 2 20
e 1 3 20
e 2 3 22
e 1 2 23

Visual representation:

    (L10) ------- L20 ------ (L11)
      |                    /   |
      |                 /      |
      |              /         |
      L21          /           |
      |         L23           L22
      |        /               |
      |      /                 |
      |    /                   |
      |  /                     |
    (L10) ------ L20 -------- (L11)

5) single_graph5.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 11
v 3 11
e 0 2 20
e 1 3 20
e 1 2 20

Visual representation:

(10) -- 20 --  (11) -- 20 – (10) –-- 20 –---(11)
  0            2           1                3

Two standard benchmark datasets

Moreover, here are two popular datasets that are used in frequent sub-graph mining (I have obtained them from the GitHub website):

Conclusion

In this blog post, I have share some helpful datasets.  If you want to know more about subgraph mining you may read my short introduction to subgraph mining.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

This entry was posted in Big data, Data Mining, Data science, Graph mining and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *