SPMF Upcoming feature: Graph viewer

Today, I will give you a preview of another upcoming feature of SPMF, which will be released in the next version of SPMF (2.59). It is the Graph Viewer tool.

The Graph Viewer is a simple tool for visualizing graphs. The Graph Viewer is designed to display graphs that can be directed or undirected, and have labels. The Graph Viewer can also automatically choose an appropriate layout for visualizing a graph.

Why a Graph Viewer in SPMF? It will be used to allow users to visualize input files containing graphs and output files containing frequent subgraphs. This is useful to visualize the inpu files of frequent subgraph mining algorithms such as gSpan, cgSpan and TKG, as well as the patterns that are discovered by these algorithms (frequent subgraphs).

I have completely implemented the Graph Viewer in Java, without using external libraries so as to avoid dependencies and to make it as lightweight and fast as possible, a long-time design goal of SPMF. In fact, unlike many other data mining libraries and open-source projects, SPMF do not have any external dependencies and the code is well optimized. This ensure the stability of the project and avoid problems that could arise from relying on external libraries.

Let me now show you the current features of the Graph Viewer, which may still be updated or improved in the final release.

Opening a graph file

The first feature is to open an input file containing one or more graphs. This is done by selecting the Graph Viewer tool:

Then, let’s say that we open the example file contextTKG.txt offered in SPMF, which contains three graphs. The Graph viewer will display graphs in a window like this:

Here we see the third graph from the file. At the bottom, there are two buttons < > for navigating to the previous or the next graph. In the above picture, the third graph is shown (Graph 3 of 3). This graph has ID 3, and contains 4 nodes and 4 edges, as indicated in the bottom part of the Window. The nodes are displayed with a text of the form x:y where x is the node ID and y is the node label. Edges are displayed in blue color with their labels.

To display the graph in a pleasant way, I have implemented a forced directed graph layout algorithm, which is the Fruchterman/Reingold (1991) algorithm. It automatically places the nodes in an appropriate location so that the graph can be displayed in a beautiful way.

Opening a pattern file

We can also use the graph viewer tool to display the frequent subgraphs found by an algorithm such as TKG. For example, here I apply the TKG algorithm and select the “Graph Viewer” tool to open the result file.

The result is 16 frequent subgraphs, which are displayed by the Graph Viewer as follows:

In the above picture, we see the frequent subgraph 9. We can use the buttons <> to move to the previous or next frequent subgraph, and thus view all of the 18 subgraphs that have been found. The support of each subgraph is displayed.

Moving the graph nodes with the mouse

Another feature of the Graph Viewer is that we can move the nodes with the mouse by dragging them over the panel:

Running the graph viewer from the command line

It will be also be possible to call the Graph Viewer from the command line, just like almost all algorithms and tools from SPMF. For example, if we put the spmf.jar file in the same folder as the file contextTKG.txt, we can apply this command:

java -jar spmf.jar run Open_graph_database_file_with_graph_viewer contextTKG.txt

Then, this will start the Graph Viewer to display the file:

Displaying other types of graphs

The Graph Viewer is designed in a quite general way so that it could also display other types of graphs and be used for other functions in SPMF in the future. For example, below, I show an example graph that is created programmatically rather than by reading a file.

I use this example to show the display of directed and undirected edges. Also, we can also see that the automatic layout algorithm works quite well and display the graph in a proper way. Here is another example:

In the Java code, we can also change how the nodes are displayed. I did not offer this option in the user interface as I think it is less important though. What do you think?

Update: Choosing different types of graph layout

I had one hour of free time this morning, so I decided to add an option to choose different types of graph layout algorithm. For example, here we see three types of layout:

1) Using the Fruchterman/Reingold (1991) algorithm:

2) Using a random layout:

3) Using a grid layout:

4) Using a circle layout:

I might add more graph layout algorithms later. I think that these algorithms are quite interesting.

Update 2: a few more features

I have fixed some bugs and added a few more improvements. There is now a panel which can show the textual representations of graphs that are displayed (right side on picture below)). There is also a new button to save a graph visualization to PNG. Moreover, there is a button to resize the canvas so as to be able to show larger graphs.

Conclusion

Hope that this blog post has been interesting. My goal was to show you some upcoming feature, which I think will be useful for those working on frequent subgraph mining. If you have some suggestions to improve this tool, you may let me know in the comments below. I will consider them. Also, I might still improve this tool before it is released.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Posted in Data Mining, Data science, open-source, spmf | Tagged , , , , , , | Leave a comment

SPMF upcoming feature: Algorithm Explorer

Today, I want to show you an upcoming feature of SPMF, which is a new tool called the Algorithm Explorer to explore the list of algorithms and tools offered in SPMF. It is actually a simple tool, but I think it can be useful, as there are many algorithms.

Note that this is a preview of the tool, as it will released in the next version of SPMF (2.59).

To open the new Algorithm Explorer, in the GUI of SPMF, we have to choose:

Then, this will open the new tool called Algorithm Explorer, where algorithms are shown as a tree on the left, classified by categories, and we can see information about the selected algorithm on the right:

In the above picture, we selected the AFEM-Rules algorithm. Thus, we can see the name of the algorithm, the category, the authors of the implementation, the link to the example page, the input and output types as well as the parameters.

Searching for similar algorithms

Update: I also added two buttons that allows to search for algorithms that are similar to a selected algorithm. More precisely, we can search for (1) algorithms with the same input, output and mandatory parameters, and (2) algorithms with the same input and output but that may not have the same parameters.

For example, if we select the RuleGrowth algorithm for sequential rule mining on the left and click on this button:

It will highlight all algorithms that have the same input and output types as RuleGrowth:

And if we instead click on this button:

It will highlight the algorithms that not only have the same input and output types as RuleGrowth but also the same mandatory parameters:

In this case, we can notice that TRuleGrowth is not included anymore because although it has the same input type and output type as RuleGrowth, it has an extra parameter that is the window length.

Let me show you one more example. Let’s say that I choose a high utility itemset mining algorithm like EFIM. Then, I can quickly find that many algorithms have the same input and output types and also the same mandatory parameters:

Conclusion

That is the current version of this tool. I will think about other potential improvements. If you have any suggestions, you may tell me in the comments below, either to add more functions or improve the user interface. I will try to take them into account, when I have time.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.


Posted in open-source, spmf | Tagged , , , | 1 Comment

Brief report about IEEE ICDM 2022

In this blog post, I will give a brief report about the IEEE ICDM 2022 (International Conference on Data mining), which I have attended. It started on November 28th 2022. I have attended ICDM as a workshop organizer this year and as a co-author.

What is ICDM?

ICDM is a good conference for data mining and machine learning research. I have attended it for several years. You can see for example my reports about ICDM 2021 and ICDM 2020 on this blog. The proceedings are published by IEEE, and the conference was held offline with some online participants (for those who could not attend in person due to the pandemic).

Proceedings and acceptance rate

The proceedings were made available online to authors on the first day of the conference.

For the main proceedings, at ICDM 2022, 885 submissions were received from 54 countries. All papers were reviewed using a triple-blind process. From this, 85 regular papers and 89 short papers were accepted. Thus, the full paper acceptance rate is 9.77% and the overall acceptance rate is 20%.

For the workshop proceedings at ICDM 2022, there were 326 submissions in 15 workshops. And 135 papers (41%) were accepted.

Day 1: Workshops, but some problems

The first day was dedicated to workshops. There were many workshops on a variety of emerging topics. In particular, I co-organized the UDML workshop on utility-driven mining and learning, which is the 5th edition this year, and has been quite successful this year with about 20 submissions, 7 accepted papers, and over 20 participants.

However, on Day 1, I faced problems with the online platform that was adopted for the conference. It is Zoom Events, which can be viewed as a special version of Zoom designed to host events such as conferences with more functionalities than the basic Zoom platform. Zoom events provide a message board and a schedule of all events as well as the possibility to chat with other participants. The problems that I and other participants encountered are as follow.

First, our workshop was supposed to start at 10:00 AM but there was no “start” button to allow us to start our workshop. Thus, hours before, I sent several messages to the organizers on the platform and by e-mail, but we got no answer before the start time of our workshop. There were also a few other workshop organizers that had the same problem and were talking about how to solve the problem in the chat. As we did not receive any answer, for our workshop we decided to create a public Zoom link and send it to authors by e-mail to start the workshop on time. Then, at around 10:40, I received an e-mail telling me that I could start our workshop but we had started it already using our Zoom link, 40 minutes earlier…

Second, multiple authors from our workshop did not receive instructions about how to login to the conference before our workshop started. From 7 papers in our workshop, this affected 5 papers. So before the workshop, I receive multiple messages from authors about how to login. In one case, there was also an author who could not login because he used a different e-mail for his zoom account than for registering to the conference. And all these problems were not solved before our workshop start time. So this is another reason why we had to use a public Zoom link for our workshop.

From these problems, I think there are two lessons to learn: (1) It is better to do a rehearsal in advance with workshop organizers if a new system is used (like it was done for ICDM in Singapore in ICDM 2020, for example) to avoid issues, and (2) it would have been much simpler to just use Zoom public links instead of using Zoom events, because this latter creates problems for logging due to the restriction on e-mails who can log in.

Day 2 – Opening ceremony and …

On the second day, I wanted to attend the opening ceremony and check the keynote talks as this is usually very interesting. However, I did not receive any Zoom Events link for accessing the second day of the conference in my e-mail. So it seems that only the first day of the conference is online. That is unfortunate but I can understand as it is supposed to be an in person conference. So that will be the end of this post about ICDM 2022.

Conclusion

This was a blog post about the ICDM 2022 conference. Hope it has been interesting. Will be looking forward to next year’s ICDM 2023 conference.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Big data, Conference, Data Mining | Tagged , , , , , , | 5 Comments

How to package SPMF as an EXE file with JPackage?

In this blog post, I will explain how to use the jpackage tool that is provided with Java to (1) package the JAR file of SPMF into an EXE file for Windows, and (2) to create an installer for SPMF.

Requirements

It is necessary to have:

  • A computer with Windows (as I will give instructions for this platform, but you may do something similar on other operating system)
  • A recent version of the Java JDK (at least version 14) so that you have the jpackage command available.
  • The JAR file of SPMF. You can download it here: spmf.jar or from the website of SPMF.
  • The ICO file of SPMF (if you want to make an application that has an icon): SPMF.ico

How to create an EXE application of SPMF for Windows

Step 1. On a windows computer, create two folders /input/ and /output/ on the desktop.

Step 2. Put the files spmf.jar and SPMF.ico in the folder /input/ that you have just created.

Step 3. Open the command line of Windows and execute this command:

jpackage --input C:\Users\philippe\Desktop\input\ --dest C:\Users\philippe\Desktop\output\ -n “SPMF” --main-jar spmf.jar --main-class ca.pfv.spmf.gui.Main --type app-image --icon C:\Users\philippe\Desktop\input\SPMF.ico

where C:\Users\philippe\Desktop\ should be replaced by the path to your desktop on your computer.

After executing this command, an EXE file will have been created in the output directory \output\

Step 4. To launch the software, you can now click on SPMF.exe.

How to create an installer for SPMF for Windows?

You can follow the same steps as above but use this command:

jpackage --input C:\Users\philippe\Desktop\input\ --dest C:\Users\philippe\Desktop\output5\ -n “SPMF” --main-jar spmf.jar --main-class ca.pfv.spmf.gui.Main --type exe --icon C:\Users\philippe\Desktop\input\SPMF.ico --win-dir-chooser --win-menu --win-shortcut-prompt

Then, this will create an installer, which looks like this:

This is the installation process on Windows 11:

And those are the files after installation:

Customization and generating installers for other platforms

There are also many other options offered by jpackage, including generating packages for other platforms. For more information see the documentation of the jpackage command.

Conclusion

This was just a short blog post to show how to package SPMF.jar into a native application. I think the process of using jpackage is quite simple. In the past, I had used some other commercial tools to create EXE files for Java programs but the process was more complicated. I am thus happy to have found jpackage.


Philippe Fournier-Viger is a distinguished professor of computer science

Posted in Java, open-source, spmf | Tagged , , , | Leave a comment

New version of SPMF (2.58)

This is to announce that a new version of SPMF has been released on the 27th November 2022. This version has 7 new pattern mining algorithms:

  • the HUCI-Miner algorithm to mine closed high utility itemsets and generators at the same time (thanks to Jayakrushna Sahoo et al. for the original code )
  • the FHIM algorithm to mine all high utility itemsets (thanks to Jayakrushna Sahoo et al. for the original code)
  • the HGB algorithm to mine non redundant high utility association rules (thanks to Jayakrushna Sahoo et al. for the original code)
  • the HGB-all algorithm to derive all high utility association rules from the non redundant high utility association rules (thanks to Jayakrushna Sahoo et al. for the original code)
  • algorithms for mining sequential patterns with flexible constraints in a time-extended sequence database (eg. MOOC data)
    • the SPM-FC-L algorithm fi (Thanks to Wei Song et al. for the original code)
      • the SPM-FC-P algorithm (Thanks to Wei Song et al. for the original code)

Besides, it has several new features such as:

(1) An integrated text editor to open output file (to give an alternative to the system’s default text editor)

(2) Some improvements to the graphical user interface, such as shown below, such as colors to highlight algorithm categories and a window icon:

And some bugs have been fixed.

Besides a new MOOC.txt dataset of sequences of courses with timestamps has been added to the dataset page of SPMF.

Thanks again to all users and contributors to SPMF!


Philippe Fournier-Viger is a distinguished professor of computer science

Posted in Data Mining, Data science, open-source, spmf | Tagged , , , , , , | Leave a comment

Brief report about the MEDI 2022 conference

In this blog post, I will talk about the 11th International Conference on Model & Data Engineering (MEDI 2022), which I have attended this week. It was held from the 21st to 23rd November 2022, in Cairo, Egypt, and also online.

MEDI 2022 conference

MEDI is a good conference for modelling, database, and related topics. MEDI is ranked C in the CORE 2021 ranking. It has been held in various countries over the years such as: Egypt, Estonia, France, Morocco, Spain, Cyprus, Italy and Portugal.

The local organization of the 2022 edition of MEDI is by the Nile University.

Nile University

This year, I am glad to play the role of Program Committee (PC) Chair at this conference. Below, I give an overview of the event.

Opening ceremony

The opening ceremony was at 10:00 AM. The conference was introduced, and an overview of the program was given. Below, I present some slides from the opening ceremony with some more details.

Countries were MEDI was held:

MEDI conference by countries

Organizations. Here is an overview of the committees behind MEDI 2022:

MEDI Conference committees

Paper selection. About the program, this year 65 papers have been submitted to the main conference, and from that 18 were accepted for long presentation (28%) and 11 for short presentation (17%). The program committee consisted of 60 researchers from 23 countries, which provided 200 reviews, for an average of 3.5 reviews per paper.

MEDI 2022 statistics

Proceedings. Full presentation papers are published in a Springer LNCS volume, while short presentation papers are published together with workshop papers in a Springer CCIS volume.

MEDI 2022 conference proceedings

Totally, there was 190 authors and submissions were made from 18 countries on five main topics: (1) Modelling, (2) Image processing and diagnosis, (3) Machine learning and optimization, (4) Natural language processing and (5) Database systems.

Workshop. A workshop called DETECT was co-organized with MEDI 2022, and 4 papers were accepted in that workshop (44%).

Special issues. Two special issues are organized for best papers of MEDI 2022

MEDI 2022 special issues

Keynote talks

There was three keynote talks.

The first keynote talk was by Prof. Vincent S. Tseng from National Yang Ming Chiao Tung University about “Broad and Deep Learning of Big Heterogeneous Health Data for Medical AI: Opportunities and Challenges“.

Vincent S. Tseng keynote talk

The second talk was “A service-based approach to drone service delivery in Skyway networks” by Prof. Athman Bouguettaya from The University of Sydney, Australia

Athman Bouguettaya keynote talk

The third talk was “Safety and security are key considerations in the design of critical systems” by Dr. Colin Snook from University of Southampton, United Kingdom.

Best paper awards of MEDI 2022

The best paper awards were announced during the closing ceremony. These papers were selected based on the peer reviews and also based on the presentations given at the MEDI conference.

Conclusion

This is all for this blog post! Hope this has been interesting. Looking forward to MEDI 2023!
In about 1 week, I will talk to you about ICDM 2022.


Philippe Fournier-Viger is a distinguished professor of computer science

Posted in Conference, Database | Tagged , , , , , | Leave a comment

Brieft report about the MIWAI 2022 conference

In this blog post, I will talk about the 15th International Conference on Multi-disciplinary Trends in Artificial Intelligence (MIWAI 2022), which was held as a virtual event on November 18th, 2022. I have attended this conference to present two papers related to pattern mining.

MIWAI is a conference that has been held every year in the pacific/Asia region, for 15 years. In the past, MIWAI has been held in countries such as Vietnam, India, Thailand, China, Malaysia and Brunei. I have also attended MIWAI in Malaysia in 2019 (see my blog post about MIWAI 2019 here).

Proceedings of MIWAI 2022

This year, the MIWAI conference received 42 papers from which 19 were accepted, which includes 14 full papers (acceptance rate of 33% for full papers) and 5 short papers (acceptance rate of 45% for full + short papers).

The proceedings of the conference are published by Springer in a book from the LNAI series. Hence, the papers are well-indexed in various publication databases such as DBLP, which is good.

Schedule of the conference

The conference was organized during a single day (November 18) and online using the Zoom platform. It started with an opening session, followed by a keynote talk by Prof. Rapeepan Pitakaso from Thailand about “Artificial Multiple Intelligence System (AMIS) and its Applications”. Then, there was paper presentations organized as parallel sessions during the rest of the day.

Opening

During the opening, the organizers talked about the program, the review process, the organization, etc. It was nice to see several people that I knew already from MIWAI 2019. The organizers are very friendly and professional.

It was announced that next year, MIWAI 2023 will be in Hyderabad, India. The submission deadline is planned for 10th January 2023.

Then, MIWAI 2024 will be in Chongqing, China.

Awards were also announced. I am please that the best paper award was given to “LCIM: Mining Low Cost High Utility Itemsets“, which is my paper. There was also another paper who received an award.

Keynote talk

There was a keynote talk “AMIS – Artificial Multiple Intelligence System: Theory and Application” by Prof. Rapeepan Pitakaso from Ubon Ratchathani Unviersity, Thailand. The talk was about a system called Artificial Multiple Intelligent System (AMIS), that aims to combine multiple types of intelligence, just like humans do (verbal, linguistic intelligences, etc.). The system is based on heuristic algorithms and also combines CNN (Convolutional Neural Networks) and other techniques. Here are some pictures and slides from this presentation.

Papers on pattern mining

As readers of this blog knows, I am interested in the field of pattern mining. At the conference, there was two papers about pattern mining (my papers):

5121Philippe Fournier-Viger, M. Saqib Nawaz, Yulin He, Youxi Wu, Farid Nouioua and Unil YunMaxFEM: Mining Maximal Frequent Episodes in Complex Event Sequences [paper]
[source code] [ppt]
6226M. Saqib Nawaz, Philippe Fournier-Viger, Naji Alhusaini, Yulin He, Youxi Wu and Debdatta BhattacharyaLCIM: Mining Low Cost High Utility Itemsets [paper]
 [ppt][source code and data]

These papers are about episode mining and high utility pattern mining.

Affordable registration fee

A good thing about this conference is that the registration fee is cheap. For one paper, it costs 250$ USD and for two papers, I spent only 350$ USD. This is much cheaper than many other conferences published by Springer and also IEEE. For example, an alternative that I considered would have been to publish in some European conferences published by Springer such as DAWAK or DEXA but registering two papers for those conferences would have cost me over 1240 euros instead of 350$ USD! This is a big difference. There are also many IEEE top conferences that are very expensive and have an increasing price in recent years. For example, this year, IEEE ICDM has a registration price of 1300 USD, while the price 10 years ago was about only 500$ USD. Thus, I appreciate that MIWAI offers registration at a very reasonable price. I think it can allow researchers from all around the world to more easily publish their papers, especially when research funding is limited.

Conclusion

MIWAI 2022 was a good conference. It is not a very large conference but there are some interesting papers of good quality, and the conference is well-organized. I will try to attend again for MIWAI 2023.
Later this month, I will talk to you about the MEDI 2022 and ICDM 2022 conference.


Philippe Fournier-Viger is a distinguished professor of computer science

Posted in artificial intelligence, Conference, Data Mining | Tagged , , , | Leave a comment

How to change the language of Tableau Desktop from the registry?

This is a short post about how to change the language of Tableau if the option is not available in the Help Menu.

  1. Close Tableau
  2. On Windows, open Regedit
  3. Search for “Language code” to find the keys containing the languages options of Tableau.
  4. Change the three following keys: “Language code”, “Repository language” and “SamplesLanguage” to your preferred language. For example “en_US” is for English and “zh_CN” is for Chinese.

5. Start Tableau again and the language will have been changed.

Posted in Data science | Tagged , , | Leave a comment

An integrated text editor in SPMF

Today, I will talk about some upcoming feature for the next release of SPMF (2.58), which is a simple integrated text editor. This new release should happen in about 1 or 2 weeks (as I am very busy recently) and will contain some new algorithms. But also, it will contain an integrated text editor. This may seem very strange? Why a text editor? I will explain briefly the idea in this blog post to give an overview of this feature as I am testing it now. If you have comments, you may leave them in the comment section below

Why a text editor in SPMF?

Well, in previous version of SPMF, there was a few options to open the output files produced by the algorithms: using the system’s default text editor (e.g. Notepad on Windows) or using the Pattern Viewer tool of SPMF. Although using the system’s text editor can be good, I was thinking that having a customized text editor could be interesting and it is actually not difficult to implement. So, I will explain how it works below.

The SPMF text editor

In the next release of SPMF, there will be a new option to open the output file using the SPMF text editor instead of the default system text editor.

The SPMF text editor looks like this:

It has some useful features such as showing the line count and the line and column numbers in the status bar at the bottom. This bar can also be hidden.

Also a useful feature is to always highlight the current line (which is not done by NotePad, for example):

Besides, it is possible to activate a “night mode” from the menu:

And there is also a search bar that is very convenient for highlighting some keywords in a file, and works like the search bar in some Web browsers:

Also, there is of course the possibility of changing the font and the font size.

Besides, all the user-defined preferences are saved. So next time that you open SPMF, the same font, font size, night mode preference, window size and location, and other preferences of the SPMF text editor are kept.

The SPMF text editor can also display the size of the last file that is opened:

And here is an overview of the menus:

It has the basic important functions such as “Line wrap” and “Word wrap”.

Conclusion

Interesting? You will be able to try it in the next version of SPMF to be released soon.

For now, the features are quite simple because the aim is to provide an alternative to the system text editor specialized for opening the output files of SPMF. It is not designed to compete with a more complex text editor and to be used as a general-purpose text editor (although it could).

A limitation of the SPMF text editor is that text files are loaded in memory so it is restricted to opening files that are not too big.

Many other features could be added like highlighting keywords in the output file. So there is something to think about. Which features would be useful? You may let me know what you think. If some features are not too hard to implement and useful, I may add them. I will also do some more debugging before it can be released.

If you have any suggestions or ideas, you can let me know in the comment section or by e-mail at philfv AT qq.com

Posted in Data Mining, open-source, spmf | Tagged , , , | Leave a comment

频繁子图挖掘算法介绍

This post is a Chinese introduction to frequent subgraph mining (English here).

在这篇博文中,我将介绍一种有趣的数据挖掘算法,叫频繁子图挖掘,它主要用来在图中挖掘有用模式。这一算法非常重要,因为数据在很多领域自然地以图来表示(比如社交网络、化学分子、国家路网)。因此,通过分析图形数据来发现有趣、意外、有用的模式是有必要的,它们可以用来帮助理解数据或者做决策。

什么是图?一点理论

在讨论图分析之前,我们先介绍几个定义。是一组有一些标签顶点。用一个例子来说明一下。参考下图:

graph

这个图包含四个顶点(描绘成黄色的圆圈)。这些顶点都有“10”、“11”等标签。这些标签提供了有关顶点的信息。例如,把这张图想象成一个化学分子。标签10和11可以分别代表氢和氧这两种化学元素。标签不需要是唯一的。换句话说,同一个标签可以用来描述同一个图中的几个顶点。例如,如果上图表示化学分子,则标签”10″和”11″可分别用于代表氧和氢的所有顶点。

除了顶点,图中也包含。边是顶点之间的连线,这里用粗黑线表示。边也有一些标签。在本例中,使用了四个标签,即”20″、”21″、”22″和”23″。这些标签代表了顶点之间不同类型的关系。边的标签不唯一。

图的类型:连通图和非连通图

现实生活中可以找到许多不同类型的图。图分为连通图非连通图。让我们用一个例子来解释一下。参考以下两个图:

connected graph and disconnected graph

左边的图称为连通图,因为沿着边可以从任何顶点到其他顶点。例如,想象一下顶点代表着城市,边是城市之间的道路。这是一个可以沿着道路从一个城市到任何其他城市的连通图。如果图没有连通,则称它是一个非连通图。例如,右边的图是断开的,因为不能沿着边从其他顶点到达顶点A。在下面的文章中,我们将使用术语“图”来表示连通图。因此,我们下面讨论的所有图都是连通图。

图的类型:有向图和无向图

区分有向图和无向图也是很有用的。在无向图中,边是双向的,而在有向图中,边可以是单向的也可以是双向的。让我们用一个例子来说明一下。

directed and undirected graphs

左边是无向图,而右边是有向图。在现实生活中有向图的例子有哪些呢?例如,考虑一个图, 它的顶点表示位置,边为道路。有些道路可以双向行驶,而有些道路只能单向行驶(城市中的“单行道”)。

一些数据挖掘算法被设计成只处理无向图、有向图或两者都支持。

分析图

我们已经介绍了一些关于图的理论,那么我们可以做什么样的数据挖掘任务来分析图?这个问题有很多答案。答案取决于我们的目标是什么,也取决于我们正在分析的图的类型(有向图/无向图、连通图/非连通图、单个图或多个图)。

在这篇博文中,我将阐述一个被广泛研究的挖掘任务,称为频繁子图挖掘。子图挖掘的目的是在一组图(图形数据库)中发现令人感兴趣的子图。但我们如何判断一个子图是否令人感兴趣呢?这取决于应用场景。兴趣度可以用各种方式来定义。传统上,如果一个子图在一组图中出现多次,它就被认为是令人感兴趣的。换句话说,我们希望发现多个图共有的子图。例如,找出几种化学分子共有的化学元素, 这种类型的联系是很有用的。

在一组图中查找频繁子图的做法称为频繁子图挖掘。作为输入者,用户必须提供:
图数据库(一组图)
▪一个称为最小支持阈值的参数(minsup)。

然后,频繁子图挖掘算法将枚举输出所有的频繁子图。频繁子图是至少在图数据库中出现minsup次的子图。下面,让我们看一下包含以下三个图的图数据库:

a graph database

现在,假设我们要发现至少出现在三个图中的所有子图。因此,我们将把最小参数设置为3。通过应用频繁子图挖掘算法,我们将得到至少出现在三个图中的所有子图的集合:

frequent subgraphs

参考一下第三个子图(“频繁子图3”)。这个子图是频繁的,有3个支持度(频度),因为它出现在三个输入图中。这些出现以红色标出,如下:

occurrences of frequent subgraphs

现在一个重要的问题是如何设置minsup参数?在实际应用中,一般是通过试错法来确定参数。如果此参数设置得太高,则会找到很少的子图,而如果设置得太低,则会根据输入数据库找到数百万的子图。

现在,在实践中,哪些工具或算法可以用来查找频繁的子图?有各种频繁子图挖掘算法。其中最著名的是GASTON, FSG和GSPAN。

在单个图中挖掘频繁子图

除了发现几个图共有的子图外,频繁子图挖掘的问题也有一些变体,它包括在单个图中查找所有频繁子图而不是在图数据库中查找。方法几乎是一样的。目的也是发现频繁出现或令人感兴趣的子图。唯一的区别是支持度(频度)是如何计算的。对于这个变化,子图的支持度是它在单个输入图中出现的次数。例如,参考以下输入图:

a single large graph
frequent subgraphs

这个图包含七个顶点和六个边。如果我们通过将minsup参数设置为2,在这个单图上执行频繁子图挖掘,可以发现以下五个频繁的子图:

这些子图是频繁的,因为它们在输入图中至少出现过两次。例如,参考“频繁子图5”。这个子图有2个支持,因为它在输入图中有两次出现。这两种情况分别用红色和蓝色突出显示在下面。

在图数据库中发现模式的算法通常可以用来发现单个图中的模式。

结论

在这篇博文中,我们介绍了频繁子图挖掘的问题,包括发现在一组图中频繁出现的子图。这个数据挖掘问题已经被研究了15年之多,并提出了许多算法。有些算法是精确算法(会找到正确答案),而有些则是近似算法(不保证找到正确答案,但可能更快)。

一些算法也被设计来处理有向图或无向图,或者在单个图或图数据库中挖掘子图,或者同时做这两种。此外,子图挖掘问题还有其他几种变体,如在图中发现频繁路径,或在图中发现频繁树

此外,在一般数据挖掘中,还研究了与图有关的许多其他问题,如优化问题、社交网络中的社群检测、关系分类等。

对比其它类型的数据, 一般来说,与图有关的问题是相当复杂的。子图挖掘困难的原因之一是算法通常需要检查“子图同构”,即比较子图以确定它们是否等价。尽管如此,我认为这些问题是相当有趣的,因为有一些研究上的挑战。

希望你喜欢这篇博文。如果对这个话题感兴趣,我将来可能会在图挖掘上再写一篇博文。


Philippe Fournier-Viger是计算机科学教授,也是提供了200多种数据挖掘算法的开源数据挖掘软件SPMF的创始人,。

Posted in Chinese posts, Data Mining | Tagged , , | Leave a comment