The story of the most influential paper award of PAKDD 2024

Recently, I have attended the PAKDD 2024 conference, where I was happy to receive the most influential paper award with my co-authors. This award is a test of time type of award that is given to the paper from PAKDD 2014 that received the largest number of citations or had the largest impact over the last ten years. In this blog post, I will briefly talk about the story of this paper, and why it has been successful. Then, I will talk about some of the applications of the algorithms presented in this paper, and talk about how to get such award.

The paper

I have received the award with my co-authors for this paper:

Fournier-Viger, P., Gomariz, A., Campos, M., Thomas, R. (2014). Fast Vertical Mining of Sequential Patterns Using Co-occurrence InformationProc. 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2014) Part 1, Springer, LNAI, 8443. pp. 40-52. [ppt][source code]

This paper is about sequential pattern mining (SPM). Let me explain quickly what this is. SPM is a popular data mining task that is used to analyze sequence data. So what is a sequence? A sequence is an ordered list of symbols. For example, let me show you a few examples of sequences that we may find in some real-life applications.

In this first example, we have a sequence of customer purchases indicating that a customer has bought an apple, then some bread, and then some cake. But sequences can be found also in many other domains. Another example is a sequence of words in a text:

In that sequence the words are sequentially ordered. Another example of sequence from another domain is a sequence of locations visited by a person driving a car in a city:

If we have data represented as sequences, we can apply the task of sequential pattern mining to find patterns (subsequences) that appear frequently in those sequences. The idea is to discover patterns that could reveal something about those sequences. For example, we may want to analyze the sequences of purchases made by several customers to find some sequences of purchases common to multiple customers. Let me show you how sequential pattern mining works with a simple example. Consider a database of four sequences representing the purchases made by four different customers:

In that example, the letter a, b, c, and d represent the purchase of apple, bread, cake, and dattes, respectively. The first sequence called S_1 indicates that the first customer has bought apple and bread at the same time, followed by buying cake, and then by purchasing apple. The other three sequences (S_2, S_3 and S_4) have a similar meaning.

Now, if we want to do sequential pattern mining with this data, we must set a parameter called the minimum support threshold (abbreviated as minsup). Consider that we decide to set this parameter as minsup = 3. It means that we want to find all the sequential patterns (subsequences) that appear in at least 2 sequences of the inpput sequence database.

Let me show you what is the output for minsup = 3:

For the input sequence database on the left and minsup =3, the output is the list of sequential patterns that are presented on the right side of the above figure. Take the pattern <{a,b},{c}> as an example. This pattern is said to have a support of 3 (to appear three times) because it occurs in three sequences in the input database, as highlighted in yellow below:

As it can be seen in the above figure, apple and bread appear together and are followed by cake in the sequence S_1, S_2 and S_3. Hence, this pattern <{a,b},{c}> is said to have a support of 3, and since this value is no less than minsup, this pattern <{a,b},{c}> is also said to be a frequent sequential pattern and it is output.

It can be observed in the sequence S_2 that there can be a gap between {a,b} and {c}, and this is ok, since {a,b} still appears before {c}.

So to summarize, the task of sequential pattern mining is to find all the frequent sequential patterns in a database, given some minsup threshold set by the user. And a frequent sequential pattern is a subsequence that appears in at least minsup sequences.

In the above example, with minsup = 3, the output is 6 sequential patterns. For a task of sequential pattern mining such as in the above example, there is always only one solution, which is the set of patterns to be discovered. The challenge is in designing efficient algorithms to find this solution.

So what was the paper about? In that paper, we presented a new optimization called co-occurrence pruning that allows to considerably speed-up sequential pattern mining algorithms. The improvement obtained by our optimization can be for example to speed up an algorithm by up to 10 times. The improved algorithms in the paper are called CM-SPAM, CM-SPADE and CM-CLASP, which are improved versions of the classical SPAM, SPADE and CLASP algorithms.

The paper has won the most influential paper award by having over 340 citations, according to Google Scholar, as shown below:

Citations to the paper from 2014 to 2024

These citations are mainly of two types: (1) applications of the improved algorithms (CM-SPAM, CM-SPADE and CM-CLASP) in some real-life applications and (2) papers that have used the proposed optimization to develop other similar algorithms.

The story of this paper

So now, let me explain the story behind this paper by going back to 2013-2014. The paper was written by me and three co-authors, shown below:

At that time, I was a young professor working on pattern mining, and Antonio and Rincy were students, while Manuel was the supervisor of Antonio. But initially, we did not know each other.

From my side, I had started to develop the SPMF open-source pattern mining software in 2008, which is a free software in Java offering efficient implementations of many pattern mining algorithms. Then, around 2013, I received several e-mails from Rincy to discuss sequential pattern mining algorithms:

In particular, Rincy wanted to know which sequential pattern mining algorithm is the best. He really wanted to find the answer and did several experiments about this with SPMF, which were very interesting. But at that time, we did not have the source code of all the best algorithms to make an exhaustive comparison.

Then, I started to discuss by e-mail with Antonio from Spain, who had just published a paper at PAKDD 2023 about the CLASP algorithm for closed sequential pattern mining. Antonio accepted to share with me the code of many additional algorithms including GSP, SPADE, SPAM, PrefixSpan, CloSpan and CLASP.

So now, we had many algorithm implementations for comparing sequential pattern mining algorithms. I then continued discussing with Rincy through e-mails:

As I recall, he made an important observation, which is that vertical algorithms such as SPAM generate too many candidates, and the main cost of such algorithm is the join operation that is performed to calculate the support of each candidate. Thus, if we could find a way to reduce that number, we might be able to speed-up the algorithms…

Then, after that, based on that observation, I designed the new co-occurrence pruning optimization that is presented in the paper. The optimization was implemented by me and Antonio in several algorithms, and we have all participated together to the paper writing, and Manuel also helped for the paper. Then, we submitted it to PAKDD 2014…. and it got accepted!

At that time, as I remember, the paper was accepted but maybe had two accept and a weak accept recommendations. Thus, it was not regarded as the best paper of PAKDD 2014, but 10 years later, it is arguably the paper that had the biggest impact. From this, we may draw the conclusion that reviewers are not always right. 🙂

Why it was successful?

I believe that the main reasons why the paper was successful are the following:

  • – The paper is about a fundamental topic (sequential pattern mining) that can have applications in many fields.
  • – We have shown a clear performance improvement over the state-of-the-art algorithms and compared with many algorithms on several datasets.
  • – I promoted the paper by talking about it to other researchers in numerous occasions, by having a website, a blog, and also mentioning this paper in several of my own papers, including a survey on sequential pattern mining.
  • – I published the source code of the algorithms and datasets in the SPMF pattern mining software. Thus, it is easy for anyone to reuse my code, apply it to other domains or extend it.

Applications

The algorithms from that paper and from the SPMF pattern mining software in general, have been used in a wide range of applications from multiple fields. Some are listed in the picture below:

In particular, two representative applications of the sequential pattern mining algorithm CM-SPAM are presented in those two papers written by my team:

Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever, to appearDOI: 10.1016/j.cose.2022.102741

Nawaz, S. M., Fournier-Viger, P., He, Y., Zhang, Q. (2023). PSAC-PDB: Analysis and Classification of Protein Structures. Computers in Biology and Medicine, Elsevier, 158: 106814 (2023)
DOI: 10.1016/j.compbiomed.2023.106814

In the first paper above, we have applied sequential pattern mining to analyze the behavior of malware programs such as computer viruses, worms and trojans. In this case, the data are sequences of API calls made by programs, and we extract sequential patterns to detect (classify) different types of virus. More precisely, the sequential patterns were used as features to train different classifiers, and excellent performance was achieved over the state-of-the art approaches.

In the second paper above, a similar sequential pattern mining-based methodology is used but for analyzing biological viruses. The sequences are in this case genome sequences. Excellent results are also obtained.

Those two papers are examples, but in fact, sequential pattern mining can be applied in numerous other domains.

How to get such award?

So before concluding, how to get this award? I provide a summary in the picture below:

Conclusion

I hope that you have enjoyed this blog post! If you have any comments, you may write in the comment section below.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Big data, Data Mining, Data science, Pattern Mining | Tagged , , , , , , , | Leave a comment

A brief report about PAKDD 2024

This week, I have attended PAKDD 2024 in the city of Taipei. It was a great conference with good keynote speakers, activities and opportunities for learning and networking. In this blog post, I will give a brief overview of the conference and some news about what will happen in the following years.

What is PAKDD?

PAKDD (Pacific-Asia conference on Knowledge Discovery and Data Mining) is an international conference that focused on data mining but also machine learning, in recent years. PAKDD is the main data mining conference in the pacific-asia area. It is a long standing ocnference. This year was the 28th edition (PAKD 2024).

I like PAKDD conferences and have attended it many times. If you are interested, you may read also my previous reports about PAKDD 2014PAKDD 2015PAKDD 2017,  PAKDD 2018 and PAKDD 2019, and PAKDD 2020.

Conference proceedings

As usual, the conference proceedings of PAKDD are published in the Springer LNAI (Lectures Notes in Computer Science) series. As the number of papers has been increasing over the years, nowadays, the proceedings are published as six books:

The proceedings was made available on the conference website and was not given as a book or USB as it was done in the past. I assume that this is to be environmentally friendly, which is reasonable.

Acceptance rate at PAKDD 2024

This year, there was 720 submissions. From those 175 papers were accepted, including 133 and 42 for poster presentations. Thus, the overall acceptance rate is 23 %. The papers have been evaluated by a program committee consisting of 595 researchers.

Location

This year, the location is the city of Taipei, on the island of Taiwan. This city is a nice modern city, and the conference was held in the Taipei International Convention Center (TICC), which is well-located in the center of the city. It is also quite easy to access for the aiport. So, for this, it was a good location.

Workshops

At the conference, there was also six workshops on a variety of topics, including Fintech, affective computing, clustering, robust machine learning, temporal analytics, and pattern mining.

And in particular, I have co-organized the UDML 2024 workshop on Utility-Driven Mining and Learning (see my report about UDML 2024 here). At this workshop, we had an excellent keynote speech with Prof. Jian Pei:

The talk was about the role of data valuation in federated learning. A key point was that we cannot expect different actors to collaborate in federated learning if their own interests (e.g. in terms of money) are not taken into account. Some models were described to solve this issue.

Other activities

There was also an industry exhibition with several companies, which has been refreshing. Several companies were from Taiwan and using machine learning and data science techniques. There was also a company offering cloud services.

Some other interesting activities were the poster session, tutorials and keynote speeches. I have talked with several interesting people at the poster session. For the keynote speeches of the main conference, there was a keynote by a researcher from Google Deep Mind Ed H. Chi, talking about LLMs (Large Language Models). Another keynote was by Prof. Vipin Kumar about environmental data science. And another keynote by Prof. Huan Liu, also about LLMs.

Here is a picture of the poster session:

Social activities

The conference was on overall very well-organized. There was several social activities to allow researchers to talk together. On the first day, there was a welcome reception at the TICC in the evening:

There was also a tour of the National Palace Museum on the evening of the third day, followed by a banquet at the Silk Palace restaurant. Here is a picture from the banquet:

During the banquet, there was a good music performance, proposing music from around the world:

There was also a performance where some artist would draw different things using sand, such as this picture:

Several awards were also announced at the banquet. Here we can see Prof. Vincent S. Tseng, receiving the well-deserved Distinguished Service Award:

I also received the most influential paper award with my co-authors for a paper on sequential pattern mining that was published at PAKDD 2014 and received the most citations over 10 years from that year.

Some other important awards were given as follows:

  • Distinguished Research Contribution Award: Jiawei Han
  • Early Career Research Award: Yu-Feng LI
  • Best Paper Award: Interpreting Pretrained Language Models via Concept Bottlenecks by Zhen Tan, Lu Cheng, Song Wang, Bo Yuan, Jundong Li, Huan Liu
  • Best Student Paper Award: Towards Cost-Efficient Federated Multi-Agent RL with Learnable Aggregation by Yi Zhang, Sen Wang, Zhi Chen, Xuwei Xu, Stano Funiak, Jiajun Liu

It was also announced at the banquet that PAKDD 2025 will be held in Sydney, Australia:

That should be quite exciting. And there was some rumors that PAKDD 2026 might be in Hong Kong.

Conclusion

This was a brief overview of the PAKDD 2024 conference in Taipei. Hope you have enjoyed this blog post. Will be looking forward to PAKDD 2025.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Big data, Conference, Data Mining | Tagged , , , , , , , , | 2 Comments

Report on the UDML 2024 workshop @ PAKDD 2024

Today, was the 6th International Workshop on Utility-Driven Mining and Learning (UDML 2024), held at the PAKDD 2024 conference. The workshop was a success. There was many people in attendance (around 20), which is good considering that PAKDD is not a very large conference and that there was about 6 workshops and tutorials running at the same time.

Keynote speech by Prof. Jian Pei

A highlight of the UDML workshop was the invited talk by Prof. Jian Pei from Duke University. He is a famous researcher in data science, which has made many very important contributions to the field. The talk was called “Data valuation in federated learning” and was very interesting. Prof. Pei first introduced the topic of federated learning, a popular topic, and explained that a key issue with many current models is that the monetary aspect is note taken into account. In fact, many researchers assume that different companies or organizations will want to share their data or collaborate to create models using federated learning but do not think that actors need a reward to do so, which could for example be in monetary form.

To solve this problem, his team proposed models for federated learning that would ensure some form of fairness and other desirable properties. This is just a brief summary of the idea of this talk. Here are a few slides from the presentation:

Research papers

This year, the UDML workshop was competitive with 23 submissions and 9 papers accepted. The papers were on various topic including machine learning, but mainly focused on pattern mining (high utility pattern mining, sequential pattern mining, itemset mining, and some applications).

The proceedings of the workshop can be downloaded from this page. Here is a few pictures from some of the speakers:

Best paper award

In the opening ceremony of UDML, it was announced that the best paper award of UDML was given to the following paper, which proposed a novel algorithm for finding co-location patterns in spatial data:

A detection of multi-level co-location patterns based on column calculation and DBSCAN clustering
Ting Yang, Lizhen Wang, Lihua Zhou and Hongmei Chen

Congratulations to the winners!

Group photo

At the end of the workshop, some photos were taken with some of the attendees

Conclusion

That was a brief overview of the UDML 2024 workshop for this year. In a follow-up blog post, I will try to talk to you more about the other activities of the PAKDD 2024 conference. PAKDD is a conference that is quite interesting especially for meeting researchers from the pacific-asia area from the data mining and machine learning community.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Conference, Pattern Mining, Utility Mining | Leave a comment

Upcoming SPMF features for v.2.62 – More Dataset Stats Tools

Today, I just want to talk to you about some upcoming features of the next SPMF version, which will be called 2.62. Some feature that I have currently adding is more tools to calculate statistics about datasets, as you can see on the picture below:

Previously, there was only a few tools of this type, only for sequence databases, graph databases, and transaction databases. In the next version 2.62, there will be a tool like this for all the most important dataset types that can be read by SPMF.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Uncategorized | Leave a comment

UDML 2024 Workshop program @ PAKDD 2024

The UDML 2024 workshop on Utility-Driven Mining and Learning is coming soon. This year is the 6th edition and it will be held next week at PAKDD 2024 in Taipei. And I am glad to announce that we have a very good program this year with a keynote speech from Prof. Jian Pei (a worldwide top-level computer scientist), as well as 9 papers accepted from 23 submissions:

UDML SESSION 1 (13:40-15:10)

Session chair: Vincent S. Tseng, Philippe Fournier-Viger

13:40: Opening

13:45: Keynote talk : Data valuation in federated learning (Prof. Jian Pei) ***
jian pei

2:25: PAPER: Explainability of Highly Associated Fuzzy Churn Patterns
Yu-Chung Wang, Jerry Chun-Wei Lin and Lars Arne Jordanger

2:40: Incremental skyline frequent-utility itemset mining
Xiaojie Zhang, Guoting Chen, Linqi Song and Wensheng Gan

2:55: A detection of multi-level co-location patterns based on column calculation and DBSCAN clustering
Ting Yang, Lizhen Wang, Lihua Zhou and Hongmei Chen

Coffee break (15:10-15:30)

UDML SESSION 2 (15:30-17:00)

Session chair: Philippe Fournier-Viger

15:30 An Effective Correlated High Utility Itemset Mining Algorithm
Priscilla Okai Owiredu, Vincent Mwintieru Nofong, Selasi Kwashie and Michael Bewong

15:45: Facial Landmark Detection: An Attentive Dropout-Based Occlusion-Adaptive Deep Network
Muhammad Sadiq, Liang Junwei, Geng Yu and Zhang Yunsheng

16:00: Learning Path Recommendation for MOOCs Using Sequential Patterns and Matching Similarity
Wei Song and Qihao Zhang

16:15: A New Closed High Utility Itemsets with Loose Restriction
Qinghua Zhang, Mu-En Wu, Chien-Ping Chung and Jimmy Ming-Tai Wu

16:30 Genetic Algorithm for Efficient Descriptive Pattern Mining
Muhammad Zohaib Nawaz, M. Saqib Nawaz and Philippe Fournier-Viger

16:45: Metamorphic Testing of High-Utility Itemset Mining
Tzung-Pei Hong, Rang Lee, Bay Vo and Shu-Min Lee

Special issue and best paper award

All the accepted papers are also invited for an extension in a special issue of the Expert System journal. And a best paper award will be announced next week!

Posted in Conference | Tagged , , | Leave a comment

SPMF: bug fix about screen resolution

Hi all, this is just to let you know that I found that there was a problem with the user interface of SPMF on low resolution screens in the update 2.60. The table for setting the parameters of algorithms was not appearing properly. I fixed the bug and replace the files spmf.jar and spmf.zip on the website. If you had this issue, you may download the software again.

If you have any other issues with the new version of SPMF, please send me an e-mail to philfv AT qq DOT com.

In particular, if you try to compile the Java code, you may have to update your Java version to the latest version.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Pattern Mining, spmf | Tagged , , , , | Leave a comment

SPMF 2.60 is released!

This is a short message today to announce that the new version of SPMF 2.60 is finally released!

This is a major version as it contains many new things. The full lists of changes can be found on the download page. Some of the main improvements are 18 new algorithms, 21 new tools to visualize different types of data, several improvements to the user interface (some are less visible than others), and also several tools that are added like a workflow editor for running more than one algorithm one after the other, some new tools for data generation and transformation. Here is a picture of a few new windows in the graphical user interface among several:

Besides, for developers of algorithms, a collection of new data structures optimized for primitive types (int, double, etc.) are provided in the package ca.pfv.spmf.datastructures.collections, which can replace several standard Java data structures to speed up algorithms or reduce the memory usage. Here is a screenshot of some of those data structures:

I have also fixed several bugs in the software (thanks to all users who reported them). It is possible that some bugs remain, especially because there is a lot of new code. If you find any problems, please let me know at philfv AT qq DOT com. Also, you can let me know about your suggestions for improvements, if you have some ideas. 🙂 If you also want to contribute code to SPMF, please contact with me (for example, if you want that I integrate your algorithm in the software.

Thanks again to all users of SPMF and the contributors, who support this project and make it better.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Data Mining, Data science, Java, Pattern Mining, spmf | Tagged , , , , , , , , , | Leave a comment

How to download an offline copy of the SPMF documentation?

Today, I will show you how to download an offline copy of the SPMF documentation.

In the upcoming version 2.60 of SPMF, you can run this algorithm to open the windows of developpers tools:

Then you can click here to open the tool to download an offline copy of the SPMF documentation:

This will open a window to start the download:

Then, you will have a local copy of the SPMF documentation on your computer and the main page is documentation.html:

If you want to download a copy of the SPMF documentation directly using Java code. Here is how it is done:


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/*
 * Copyright (c) 2022 Philippe Fournier-Viger
 *
 * This file is part of the SPMF DATA MINING SOFTWARE
 * (http://www.philippe-fournier-viger.com/spmf).
 *
 * SPMF is free software: you can redistribute it and/or modify it under the
 * terms of the GNU General Public License as published by the Free Software
 * Foundation, either version 3 of the License, or (at your option) any later
 * version.
 *
 * SPMF is distributed in the hope that it will be useful, but WITHOUT ANY
 * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
 * A PARTICULAR PURPOSE. See the GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License along with
 * SPMF. If not, see <http://www.gnu.org/licenses/>.
 */
/**
 * This is a tool to download an offline copy of the SPMF documentation.
 * 
 * @author Philippe Fournier-Viger
 *
 */
public class AlgoSPMFDownloadDoc {

	/** The URLs that have been already downloaded */
	Set<String> alreadyDownloaded;

	/** Method to run this algorithm
	 */
	public void runAlgorithm() {
		alreadyDownloaded = new HashSet();
		String mainUrl = "https://philippe-fournier-viger.com/spmf/index.php?link=documentation.php";
		String folderPath = "doc";
		createDirectory(folderPath);
		savePage(mainUrl, folderPath + "/documentation.html", mainUrl);

		BufferedReader br = null;
		try {
			// Download the main documentation page
			URL url = new URL(mainUrl);
			HttpURLConnection conn = (HttpURLConnection) url.openConnection();
			br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
			String inputLine;
			StringBuilder content = new StringBuilder();
			while ((inputLine = br.readLine()) != null) {
				content.append(inputLine);
				content.append(System.lineSeparator());
			}

			// Replace all .php references with .html in the content
			String updatedContent = content.toString().replaceAll("\\.php", ".html");

			// Save CSS files
			Pattern cssPattern = Pattern.compile("href=\"(.*?\\.css)\"");
			Matcher cssMatcher = cssPattern.matcher(updatedContent);
			while (cssMatcher.find()) {
				String cssLink = cssMatcher.group(1);
				savePage(cssLink, folderPath + "/" + cssLink.substring(cssLink.lastIndexOf('/') + 1), mainUrl);
			}

			// Save pages and images that start with "Example"
			Pattern examplePattern = Pattern.compile("<a href=\"([^\"]+)\">Example");
			Matcher exampleMatcher = examplePattern.matcher(updatedContent);
			while (exampleMatcher.find()) {
				String exampleLink = exampleMatcher.group(1);
				savePage(exampleLink, folderPath + "/" + exampleLink, mainUrl);

			}

		} catch (MalformedURLException e) {
			System.err.println("The URL provided is not valid: " + mainUrl);
			e.printStackTrace();
		} catch (IOException e) {
			System.err.println("An I/O error occurred while processing the URL: " + mainUrl);
			e.printStackTrace();
		} finally {
			if (br != null) {
				try {
					br.close();
				} catch (IOException e) {
					System.err.println("An error occurred while closing the BufferedReader.");
					e.printStackTrace();
				}
			}
		}
	}

	/**
	 * Method to create a folder
	 * @param folderPath the path
	 */
	private void createDirectory(String folderPath) {
		try {
			Files.createDirectories(Paths.get(folderPath));
		} catch (IOException e) {
			System.err.println("An error occurred while creating the directory: " + folderPath);
			e.printStackTrace();
		}
	}

	/**
	 * Method to save a webpage
	 * @param urlString the url
	 * @param filePath the filepath where it should be saved
	 * @param baseUri the base URI
	 */
	private void savePage(String urlString, String filePath, String baseUri) {
		if (alreadyDownloaded.contains(urlString)) {
			return;
		}
		alreadyDownloaded.add(urlString);

		BufferedReader reader = null;
		try {
			URL url;
			// Check if the URL is absolute or relative
			if (urlString.startsWith("http://") || urlString.startsWith("https://")) {
				url = new URL(urlString);
			} else {
				// Convert relative URL to absolute URL
				URI base = new URI(baseUri);
				url = base.resolve(urlString).toURL();
			}

			HttpURLConnection conn = (HttpURLConnection) url.openConnection();
			reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
			StringBuilder contentBuilder = new StringBuilder();
			String inputLine;
			while ((inputLine = reader.readLine()) != null) {
				contentBuilder.append(inputLine);
				contentBuilder.append(System.lineSeparator());
			}

			// Change the file extension from .php to .html
			filePath = filePath.replace(".php", ".html");

			// Update links in the content
			String content = contentBuilder.toString();
			content = content.replaceAll("href=\"([^\"]+).php\"", "href=\"$1.html\"");
			content = content.replaceAll("https://www.philippe-fournier-viger.com/spmf/index.php\\?link=documentation\\.html", "documentation.html");

	        // Find and save images
	        Pattern imgPattern = Pattern.compile("src=\"([^\"]+\\.(png|jpg))\"");
	        Matcher imgMatcher = imgPattern.matcher(content);
	        while (imgMatcher.find()) {
	            String imgLink = imgMatcher.group(1);
	            String imgName = imgLink.substring(imgLink.lastIndexOf('/') + 1);
	            saveImage(imgLink, "doc/" + imgName, baseUri);
	        }
	        
			// Save the updated content to file
			Files.write(Paths.get(filePath), content.getBytes(StandardCharsets.UTF_8));
		} catch (URISyntaxException e) {
			System.err.println("The URI provided is not valid: " + urlString);
			e.printStackTrace();
		} catch (MalformedURLException e) {
			System.err.println("A malformed URL has occurred for the URI: " + urlString);
			e.printStackTrace();
		} catch (IOException e) {
			System.err.println("An I/O error occurred while saving the page: " + urlString);
			e.printStackTrace();
		} finally {
			if (reader != null) {
				try {
					reader.close();
				} catch (IOException e) {
					System.err.println("An error occurred while closing the BufferedReader.");
					e.printStackTrace();
				}
			}
		}
	}
	
	/**
	 * Method to save an image
	 * @param urlString the url
	 * @param filePath the filepath where it should be saved
	 * @param baseUri the base URI
	 */
	private void saveImage(String urlString, String filePath, String baseUri) {
		if (alreadyDownloaded.contains(urlString)) {
			return;
		}
		alreadyDownloaded.add(urlString);
		
	    InputStream in = null;
	    try {
	        URL url;
	        // Check if the URL is absolute or relative
	        if (urlString.startsWith("http://") || urlString.startsWith("https://")) {
	            url = new URL(urlString);
	        } else {
	            // Convert relative URL to absolute URL
	            URI base = new URI(baseUri);
	            url = base.resolve(urlString).toURL();
	        }
	        
	        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
	        in = conn.getInputStream();
	        Files.copy(in, Paths.get(filePath), StandardCopyOption.REPLACE_EXISTING);
	    } catch (URISyntaxException e) {
	        System.err.println("The URI provided is not valid: " + urlString);
	        e.printStackTrace();
	    } catch (MalformedURLException e) {
	        System.err.println("A malformed URL has occurred for the URI: " + urlString);
	        e.printStackTrace();
	    } catch (IOException e) {
	        System.err.println("An I/O error occurred while saving the image: " + urlString);
	        e.printStackTrace();
	    } finally {
	        if (in != null) {
	            try {
	                in.close();
	            } catch (IOException e) {
	                System.err.println("An error occurred while closing the InputStream.");
	                e.printStackTrace();
	            }
	        }
	    }
	}

}

Hope that this blog post has been interesting. The new version 2.60 of SPMF will be released in the next few days.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in spmf | Tagged , , , , , , | Leave a comment

Is EasyChair still good?

Today, I will talk about EasyChair, one of the oldest conference management systems, used in academia, especially in computer science (founded in 2002). As many other researchers, I have been a user of EasyChair for over a decade, especially as an author and also as a program committee member for various conferences. I used to think that EasyChair was relatively easy to use. But in recent years, it seems that EasyChair has become more business-oriented, and is basically no longer free to use since 2022 (unless an event receives no more than 20 submissions, which is quite small). So, is EasyChair still good?

To find out, I recently decided to give EasyChair a try for organizing a small academic event. The user-interface was quite familiar. However, the design itself did not change much over the last decade (as I recall). I saw that EasyChair had a 20 submission limit for free but thought that I would still use it as I did not expect to receive many submissions. But eventually, I faced a problem. I received 21 submissions. So EasyChair sent me two warning e-mails saying that if I do not upgrade within 7 days, “reviewer access to your conference can be restricted or disabled“.

What does this warning means? Well, unfortunately, I did not read that e-mail in time because I am quite busy and my EasyChair account is linked to a secondary e-mail address. So I soon found out what it means. EasyChair not only disabled the reviewer access but it also locked me out of the conference chair account and asked me for money to regain access:

This happened after I had set the paper decisions and sent the notifications to authors, and just before authors would submit their camera-ready papers, that is at the most critical moment.

At that moment, EasyChair gave me the option of either paying a relatively large amount or to forever lose access to the data about the event such as the list of papers, the authors’ e-mail addresses and reviews.

Luckily, I had saved the author emails and the list of accepted papers in an Excel file, otherwise, I would have to pay a relatively large amount of money to recover that. And since my academic event is non profit, I do not have money to pay for this.

That situation is very inconvenient, and honestly is one of the reasons why, I will not use EasyChair anymore.

If we put this bad experience aside, lets talk about other aspects of EasyChair such as the user interface. While it is relatively easy to use, I think that it looks outdated:

In terms of features provided to conference managers, I see that EasyChair has been enhanced with many additional features over the years, but for small events, I don’t think that many of these features are essential.

And in recent years, some excellent alternatives to EasyChair have emerged such as Microsoft CMT that is free. As far as I see, it seems that the major conferences in computer science have moved to Microsoft CMT. This latter has in my opinion a better user interface, is fast, and easier to use. There are also several other conference management systems that are free to use such as the excellent OpenConf, which can be downloaded and installed on our own server (the community edition, which includes many essential features).

Given that EasyChair now has a very small 20 submission limit for the free license and that it locks out an account if the number of submissions exceeds 20, I will not use EasyChair anymore for organizing events. I think that there are many better alternatives.

What do you think? Do you still use EasyChair? Do you like it? Do you think there are better conference management systems? Leave a comment below to share your opinion.

Posted in Academia, Conference | Tagged , , , | Leave a comment

Some interesting statistics about SPMF

While I am preparing the next version of Java SPMF data mining software (2.60), here are some interesting statistics about the project, that I have generated directly from the metadata provided by SPMF. Here it is:

The number of algorithms implemented per person (based on metadata)
Note: this is generated automatically according to the metadata of each algorithm in SPMF using the class DescriptionOfAlgorithm, and some author names are spelled in multiple ways, and may contain some errors. The full list of contributors of SPMF is displayed on the SPMF website.

Philippe Fournier-Viger206
Yang Peng12
Antonio Gomariz Penalver9
Jayakrushna Sahoo6
Jerry Chun-Wei Lin5
Lu Yang5
Chen YangMing5
Wei Song et al.5
Yangming Chen5
Wei Song4
Ting Li4
Azadeh Soltani4
Peng Yang and Philippe Fournier-Viger4
Nader Aryabarzan4
Vincent M. Nofong modified from Philippe Fournier-Viger3
Cheng-Wei Wu et al.3
Zhihong Deng3
Prashant Barhate3
Chaomin Huang et al.3
Jiaxuan Li3
Zhitian Li3
Antonio Gomariz Penalver & Philippe Fournier-Viger3
Yimin Zhang2
Chaomin Huang2
Nouioua et al.2
Ting Li et al.2
Philippe Fournier-Viger and Yuechun Li2
Song et al.2
Fournier-Viger et al.2
Saqib Nawaz et al.2
Chao Cheng and Philippe Fournier-Viger2
Zevin Shaul et al.2
Alan Souza2
Rathore et al.2
Bay Vo et al.2
Junya Li2
Ryan Benton and Blake Johns2
Siddharth Dawar et al.2
Yanjun Yang2
Siddhart Dawar et al.2
Huang et al.1
M.1
C.W. Wu et al.1
Philippe Fournier-Viger and Cheng-Wei Wu1
Sacha Servan-Schreiber1
Dhaval Patel1
jnfrancis1
Cheng-Wei. et al.1
Ganghuan He and Philippe Fournier-Viger1
Siddharth Dawar1
Improvements by Nouioua et al.1
Philippe Fournier-Viger and Chao Cheng1
Yang Peng et al.1
Salvemini E1
Java conversion by Xiang Li and Philippe Fournier-Viger1
Alex Peng et al.1
Hoang Thanh Lam1
Souleymane Zida1
F.1
Shifeng Ren1
Lanotte1
github: limuhangk1
Youxi Wu et al.1
Hazem El-Raffiee1
Jiakai Nan1
Ahmed El-Serafy1
Souleymane Zida and Philippe Fournier-Viger1
Feremans et al.1
Han J.1
Shi-Feng Ren1
Fumarola F1
Vikram Goyal1
P. F.1
Petijean et al.1
Srinivas Paturu1
Malerba D1
& Malerba1
Ashish Sureka1
Fumarola1
Ying Wang and Peng Yang and Philippe Fournier-Viger1
Sabarish Raghu1
Wu et al.1
D.1
Srikumar Krishnamoorty1
Siddharth Dawar et al1
Ceci1
Wu1

The number of algorithms per category

HIGH-UTILITY PATTERN MINING83
FREQUENT ITEMSET MINING54
SEQUENTIAL PATTERN MINING48
TOOLS – DATA VIEWERS22
TIME SERIES MINING16
ASSOCIATION RULE MINING16
TOOLS – DATA TRANSFORMATION15
PERIODIC PATTERN MINING13
EPISODE MINING10
EPISODE RULE MINING10
CLUSTERING10
SEQUENTIAL RULE MINING10
GRAPH PATTERN MINING6
TOOLS – DATA GENERATORS5
TOOLS – STATS CALCULATORS4
TOOLS – SPMF GUI4
TOOLS – RUN EXPERIMENTS1
PRIVACY-PRESERVING DATA MINING1

The number of algorithms per type

DATA_MINING259
DATA_PROCESSOR30
DATA_VIEWER25
DATA_GENERATOR5
OTHER_GUI_TOOL4
DATA_STATS_CALCULATOR4
EXPERIMENT_TOOL1

The number of algorithms for each input data type

  • Transaction database (194)
  • Simple transaction database (80)
  • Transaction database with utility values (77)
  • Sequence database (73)
  • Simple sequence database (48)
  • Transaction database with timestamps (17)
  • Time series database (16)
  • Sequence database with timestamps (9)
  • Database of double vectors (8)
  • Labeled graph database (6)
  • Graph database (6)
  • Transaction database with utility values and time (5)
  • Multi-dimensional sequence database with timestamps (4)
  • Text file (4)
  • Multi-dimensional sequence database (4)
  • Time interval sequence database (3)
  • Sequence database with utility values (3)
  • Transaction database with utility values and taxonomy (3)
  • Transaction database with shelf-time periods and utility values (3)
  • Transaction database with utility values (HUQI) (3)
  • Sequence database with cost and binary utility (3)
  • Simple time interval sequence database (3)
  • Frequent closed itemsets (3)
  • Sequence database with cost and numeric utility (2)
  • Transaction database with utility values skymine format (2)
  • Transaction database with profit information (2)
  • Uncertain transaction database (2)
  • ARFF file (2)
  • Transaction database with utility and cost values (2)
  • Sequence database with strings (2)
  • Dynamic Attributed Graph (2)
  • Simple sequence database with strings (2)
  • Sequence database with utility and probability values (2)
  • Cost sequence database (2)
  • Sequential patterns (1)
  • Set of text documents (1)
  • Sequence database in non SPMF format (1)
  • Clusters (1)
  • Sequence (1)
  • Single sequence (1)
  • Transaction database with utility values (MEMU) (1)
  • Transaction database in non SPMF format (1)

The number of algorithms for each output data type

  • High-utility patterns (91)
  • High-utility itemsets (60)
  • Frequent patterns (56)
  • Sequential patterns (51)
  • Frequent itemsets (37)
  • Frequent sequential patterns (30)
  • Database of instances (22)
  • Association rules (16)
  • Episodes (15)
  • Time series database (14)
  • Periodic patterns (13)
  • Transaction database (12)
  • Periodic frequent patterns (12)
  • Sequential rules (11)
  • Episode rules (10)
  • Frequent closed itemsets (9)
  • Frequent closed sequential patterns (8)
  • Simple transaction database (8)
  • Sequence database (8)
  • Closed itemsets (8)
  • Top-k High-utility itemsets (7)
  • Closed high-utility itemsets (7)
  • Simple sequence database (6)
  • Frequent Sequential patterns (6)
  • Clusters (6)
  • Closed patterns (6)
  • Frequent episodes (6)
  • Frequent sequential rules (5)
  • Rare itemsets (5)
  • Rare patterns (5)
  • High average-utility itemsets (5)
  • Frequent episode rules (5)
  • Skyline patterns (4)
  • Subgraphs (4)
  • Generator patterns (4)
  • High-Utility episodes (4)
  • Generator itemsets (4)
  • Cost-efficient Sequential patterns (3)
  • Skyline High-utility itemsets (3)
  • Frequent sequential generators (3)
  • Frequent subgraphs (3)
  • Frequent itemsets with multiple thresholds (3)
  • Local Periodic frequent itemsets (3)
  • Correlated patterns (3)
  • Quantitative high utility itemsets (3)
  • Maximal patterns (2)
  • Cross-Level High-utility itemsets (2)
  • Multi-dimensional frequent closed sequential patterns (2)
  • Maximal itemsets (2)
  • High-utility probability sequential patterns (2)
  • Frequent maximal sequential patterns (2)
  • Density-based clusters (2)
  • Periodic frequent itemsets common to multiple sequences (2)
  • On-shelf high-utility itemsets (2)
  • Multi-dimensional frequent closed sequential patterns with timestamps (2)
  • Top-k frequent sequential rules (2)
  • Frequent maximal itemsets (2)
  • Frequent closed and generator itemsets (2)
  • Closed association rules (2)
  • Frequent time interval sequential patterns (2)
  • Perfectly rare itemsets (2)
  • Sequence Database with timestamps (2)
  • Top-k frequent sequential patterns (2)
  • Top-k High-Utility episodes (2)
  • Transaction database with utility values (2)
  • Closed and generator patterns (2)
  • Periodic high-utility itemsets (2)
  • Minimal rare itemsets (2)
  • Trend patterns (2)
  • Correlated High-utility itemsets (2)
  • Generator high-utility itemsets (2)
  • Association rules with lift and multiple support thresholds (2)
  • Rare correlated itemsets common to multiple sequences (1)
  • Productive Periodic frequent itemsets (1)
  • Peak high-utility itemsets (1)
  • Indirect association rules (1)
  • Top-k non-redundant association rules (1)
  • Top-k class association rules (1)
  • Database of double vectors (1)
  • Uncertain frequent itemsets (1)
  • Non-redundant Periodic frequent itemsets (1)
  • Transaction database with utility values and time (1)
  • Top-k Stable Periodic frequent itemsets (1)
  • Rare correlated itemsets (1)
  • Frequent sequential rules with strings (1)
  • Top-k frequent episodes (1)
  • Minimal itemsets (1)
  • High-utility association rules (1)
  • Uncertain patterns (1)
  • Ordered frequent sequential rules (1)
  • Top-k association rules (1)
  • High-utility itemsets with length constraints (1)
  • High-utility generator itemsets (1)
  • Multi-dimensional frequent sequential patterns with timestamps (1)
  • Local high-utility itemsets (1)
  • High-utility sequential rules (1)
  • Periodic frequent itemsets (1)
  • Density-based cluster ordering of points (1)
  • Frequent sequential patterns with occurrences (1)
  • Frequent sequential patterns with timestamps (1)
  • Frequent closed sequential patterns with timestamps (1)
  • Minimal high-utility itemsets (1)
  • Significant Trend Sequences (1)
  • Frequent fuzzy itemsets (1)
  • Periodic rare patterns (1)
  • Top-k Frequent subgraphs (1)
  • Minimal patterns (1)
  • Stable Periodic frequent itemsets (1)
  • Maximal high-utility itemsets (1)
  • Self-Sufficient Itemsets (1)
  • Text clusters (1)
  • Multi-dimensional frequent sequential patterns (1)
  • Top-k frequent non-redundant sequential rules (1)
  • Cost transaction database (1)
  • Hierarchical clusters (1)
  • Top-k frequent sequential patterns with leverage (1)
  • Compressing sequential patterns (1)
  • Minimal non-redundant association rules (1)
  • Sporadic association rules (1)
  • High-utility rules (1)
  • Locally trending high-utility itemsets (1)
  • Skyline Frequent High-utility itemsets (1)
  • High-utility sequential patterns (1)
  • Progressive Frequent Sequential patterns (1)
  • Erasable itemsets (1)
  • Attribute Evolution Rules (1)
  • Generators of high-utility itemsets (1)
  • Multiple Frequent fuzzy itemsets (1)
  • Correlated itemsets (1)
  • Multi-Level High-utility itemsets (1)
  • Association rules with lift (1)
  • Top-k sequential patterns with quantile-based cohesion (1)
  • Erasable patterns (1)
  • Frequent generator itemsets (1)
  • Frequent high-utility itemsets (1)

Conclusion

Hope that this is interesting 🙂 If you have any comments, please leave them below.

Posted in Data Mining, Data science, spmf | Tagged , , , , , | Leave a comment