Subgraph mining datasets

In this post, I will provide two standard benchmark datasets that can be used for frequent subgraph mining. Moreover, I will provide a set of small graph datasets that I have created for debugging subgraph mining algorithms.

subgraph mining datasets

The format of graph datasets

A graph dataset is a text file which contains one or more graphs.  A graph is defined by a few lines of text that follow the following format (used by the GSpan algorithm)

  • t # N    This is the first line of a graph. It indicates that this is the N-th graph in the file
  • v M L     This line defines the M-th vertex of the current graph, which has a label L
  • e P Q L   This line defines an edge, which connects the P-th vertex with the Q-th vertex. This edge has the label L

Five small datasets

Here are five small datasets that I have created for debugging frequent subgraph mining algorithms. Each dataset contains a single graph, which is enough for some small debugging tasks.

1) single_graph1.txt 

Content of the file:

t # 1
v 0 10
v 1 11
v 2 12
e 0 1 21
e 2 1 21

Visual representation:

(L10) ---L21--- (L11) ---- L21 ---- (L12)
  0              1                   2

2) single_graph2.txt

Content of the file:

t # 1
v 0 10
v 1 11
v 2 10
v 3 10
e 0 1 21
e 2 1 21
e 1 3 21

Visual representation:

(L10) --- L21 --- (L11) --- L21 ---- (L10)
  0                 1                  2
                    |
                    |
                   L21
                    |
                    |
                  (L10)3

3) single_graph3.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 10
e 0 1 20
e 1 2 20
e 2 0 20

Visual representation:

 (L10)---- (L11) ---- (L10)
    0        1          2

4) single_graph4.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 11
v 3 11
e 0 1 21
e 0 2 20
e 1 3 20
e 2 3 22
e 1 2 23

Visual representation:

    (L10) ------- L20 ------ (L11)
      |                    /   |
      |                 /      |
      |              /         |
      L21          /           |
      |         L23           L22
      |        /               |
      |      /                 |
      |    /                   |
      |  /                     |
    (L10) ------ L20 -------- (L11)

5) single_graph5.txt

Content of the file:

t # 1
v 0 10
v 1 10
v 2 11
v 3 11
e 0 2 20
e 1 3 20
e 1 2 20

Visual representation:

(10) -- 20 --  (11) -- 20 – (10) –-- 20 –---(11)
  0            2           1                3

Two standard benchmark datasets

Moreover, here are two popular datasets that are used in frequent sub-graph mining (I have obtained them from the GitHub website):

Conclusion

In this blog post, I have share some helpful datasets.  If you want to know more about subgraph mining you may read my short introduction to subgraph mining.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

(Visited 14 times, 1 visits today)
This entry was posted in Big data, Data Mining, Data science, Graph mining and tagged , , , . Bookmark the permalink.

3 Responses to Subgraph mining datasets

  1. Thinh Tran says:

    Hi Mr Philippe,

    Thank you for your effort to write this post. I am a Master student and I am working on Subgraph Mining. Do you know where I can find graph dataset (multi graph in dataset) with weighted on edges? I have searched on internet a lot but it only show dataset weighted with single graph. I would be appreciated if you answer my question. Thank you for your time and response.

    Best,
    Thinh

    • Hello, I do not have and did not search for that, so I cannot tell you about where to find it without searching for it, just like you would do. It is quite possible that there is no publicly available datasets. In that case, you have a few possibility : (1) contact authors of papers who have such datasets to ask them for their datasets, (2) make your own dataset by converting some public data to what you want or collecting your own data, or (3) use some synthetic dataset. In that case, you take for example some normal graph datasets and just generate the weights randomly for example (it is not as good as having a real datasets though).

      By the way, if you find some datasets, you can share, and I can add them to this page.
      Best,
      Philippe

  2. Dang Nguyen says:

    Hi Thinh and Philippe,

    I have uploaded nine new large graph datasets for frequent subgraph mining to the link:
    https://github.com/nphdang/gSpan/tree/master/Data

    These datasets have edge labels which can be used as unnormalized edge weights.
    For more information, please refer to this post in the data mining forum:
    http://forum.ai-directory.com/read.php?5,5250

    Cheers,
    Dang Nguyen

Leave a Reply

Your email address will not be published. Required fields are marked *