In this post, I will provide two standard benchmark datasets that can be used for frequent subgraph mining. Moreover, I will provide a set of small graph datasets that I have created for debugging subgraph mining algorithms.
The format of graph datasets
A graph dataset is a text file which contains one or more graphs. A graph is defined by a few lines of text that follow the following format (used by the GSpan algorithm)
- t # N This is the first line of a graph. It indicates that this is the N-th graph in the file
- v M L This line defines the M-th vertex of the current graph, which has a label L
- e P Q L This line defines an edge, which connects the P-th vertex with the Q-th vertex. This edge has the label L
Five small datasets
Here are five small datasets that I have created for debugging frequent subgraph mining algorithms. Each dataset contains a single graph, which is enough for some small debugging tasks.
Content of the file:
t # 1 v 0 10 v 1 11 v 2 12 e 0 1 21 e 2 1 21
Visual representation:
(L10) ---L21--- (L11) ---- L21 ---- (L12) 0 1 2
Content of the file:
t # 1 v 0 10 v 1 11 v 2 10 v 3 10 e 0 1 21 e 2 1 21 e 1 3 21
Visual representation:
(L10) --- L21 --- (L11) --- L21 ---- (L10) 0 1 2 | | L21 | | (L10)3
Content of the file:
t # 1 v 0 10 v 1 10 v 2 10 e 0 1 20 e 1 2 20 e 2 0 20
Visual representation:
(L10)---- (L11) ---- (L10) 0 1 2
Content of the file:
t # 1 v 0 10 v 1 10 v 2 11 v 3 11 e 0 1 21 e 0 2 20 e 1 3 20 e 2 3 22 e 1 2 23
Visual representation:
(L10) ------- L20 ------ (L11) | / | | / | | / | L21 / | | L23 L22 | / | | / | | / | | / | (L10) ------ L20 -------- (L11)
Content of the file:
t # 1 v 0 10 v 1 10 v 2 11 v 3 11 e 0 2 20 e 1 3 20 e 1 2 20
Visual representation:
(10) -- 20 -- (11) -- 20 – (10) –-- 20 –---(11) 0 2 1 3
Two standard benchmark datasets
Moreover, here are two popular datasets that are used in frequent sub-graph mining (I have obtained them from the GitHub website):
- Coumpound_422.txt : It contains 422 graphs
- Chemical_340.txt : It contains 340 graphs
Conclusion
In this blog post, I have share some helpful datasets. If you want to know more about subgraph mining you may read my short introduction to subgraph mining.
—
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.
Hi Mr Philippe,
Thank you for your effort to write this post. I am a Master student and I am working on Subgraph Mining. Do you know where I can find graph dataset (multi graph in dataset) with weighted on edges? I have searched on internet a lot but it only show dataset weighted with single graph. I would be appreciated if you answer my question. Thank you for your time and response.
Best,
Thinh
Hello, I do not have and did not search for that, so I cannot tell you about where to find it without searching for it, just like you would do. It is quite possible that there is no publicly available datasets. In that case, you have a few possibility : (1) contact authors of papers who have such datasets to ask them for their datasets, (2) make your own dataset by converting some public data to what you want or collecting your own data, or (3) use some synthetic dataset. In that case, you take for example some normal graph datasets and just generate the weights randomly for example (it is not as good as having a real datasets though).
By the way, if you find some datasets, you can share, and I can add them to this page.
Best,
Philippe
Hi Thinh and Philippe,
I have uploaded nine new large graph datasets for frequent subgraph mining to the link:
https://github.com/nphdang/gSpan/tree/master/Data
These datasets have edge labels which can be used as unnormalized edge weights.
For more information, please refer to this post in the data mining forum:
http://forum.ai-directory.com/read.php?5,5250
Cheers,
Dang Nguyen
Thank you Sir,