Today, I want to share briefly a new feature of the upcoming SPMF version 2.63, which is the taxonomy viewer. This tools allow to visualize a taxonomy used by algorithms such as CLH-Miner and FEACP.
The user interface is for now quite simple and looks like this:
In this example, I display a file called transaction_CLHMiner.txt, which defines the taxonomy:
1,6
2,6
3,7
4,8
5,8
6,7
9,1
10,1
And a transaction database given names to these items in a file taxonomy_CLHMiner.txt:
Today, I would like to share the call for papers for the upcoming OCSA 2025 conference, which will be held in Changsha, capital of the Hunan province of China. The details of the conference are as follows:
International Conference on Optoelectronics, Computer Science, and Algorithms (OCSA 2025) Website: www.icocsa.net Email: ocsa@163.com Dates: September 19-21, 2025 Venue: Changsha, China Indexing: EI database (pending approval) Submission link: https://ocs.academiccenter.com/manager/dashboard
If you are interested, you may consider submitting a paper.
I am still working on the upcoming version of SPMF. There is still much work todo. Following the previous post, I would like to give you a glimpse of the new visual pattern viewer that is under development for SPMF. This is a new tool that will allow to see patterns visually and search and filter them. It will be well-designed and will work with most pattern types that can be produced by SPMF. Here is a picture of the current version:
On that picture we can see a list of high utility sequential patterns displayed as a grid, where the utility of each pattern is displayed using a colored bar that varies from 0 to the maximum utility observed among the patterns in this example, which is 40. On the right, there is a search panel to filter the pattern by utility values or to search for a specific itemsets. In this example, only the patterns with a utility of at least 37 are displayed using the filter. In the menu bar, we can also sort the patterns by different orders:
For example, above the patterns were displayed by decreasing number of itemsets. We can change this to display them by ascending utility, and remove the filters, to obtain this:
Now, let me show you other pattern types. Here are high utility sequential rules:
And here are frequent high utility itemsets:
Update: I have fixed the alignment of colored bars representing measures. And the different measures are now displayed with different colors:
Conclusion
This is just an overview of this upcoming tool that will be incorporated in the new version of SPMF. If you have suggestions, please leave a comment below. Note also that this tool is under development. The released version may be different and have additional features.
This workshop is suitable for pattern mining papers and machine learning, especially with applications to bioinformatics. The scope is quite large. You may the HP4MoDa workshop website for more details or e-mail me if you are not sure if your paper fit in the scope of the workshop!
A good thing is that the workshop will be in Hybrid mode, which means that participants can present their papers online or offline in Wuhan, China.
The dates are as follows:
Paper Submission:October 15, 2025
Acceptance Notification: November 10, 2025
Camera-ready Submission: November 23, 2025
Workshop date: December 15, 2025
All papers will be published in the regular conference proceedings of IEEE BIBM And we are planning to organize a special issue in a good journal as well (under discussion).
Today, I will talk about some features of the upcoming version of SPMF2.63 that I am working on and is planned for release next month.
Improved Pattern Visualizer window
The first feature is an improvement of the Pattern Viewer tool so that it can display the richer SPMF format where items have names (strings) in patterns. Previously, a pattern file like this:
I am also working on adding other tools for visualization such as a Rule Viewer to visualize association rules and sequential rules like this:
The above screenshot is an early version. I will improve the appearance of this window and I still have to think about how to best integrate it in the software.
Today, I just wanted to show you some ideas of new features. If you have any ideas or comments, send me an e-mail or leave a comment below!
We are now in 2025, and like last year, I will analyze the list of the computer science journals and conferences with the most withdrawal for the previous year (see this blog post for 2023).
Thus, I only consider journal and conferences that are indexed in DBLP, which covers the best journals and conferences in computer science. I also only consider venues with at least 2 withdrawn papers.
Results
Rank
Journal Name
Withdrawn paper Count
1 ▲
J. Intell. Fuzzy Syst.
45 ▲ (last year: 2)
2 –
Multim. Tools Appl.
27 ▼ (last year: 37)
3 ▲
Ann. Oper. Res.
23▲ (last year: 2)
4
Expert Syst. J. Knowl. Eng.
20▲ (new)
5
Trans. Emerg. Telecommun. Technol.
11 ▲ (new)
6
Int. J. Speech Technol.
9▲(new)
7 ▲
Pers. Ubiquitous Comput.
9▲ (last year: 6)
8
Comput. Intell.
9▲ (new)
9
Int. J. Pervasive Comput. Commun.
7▲ (new)
10 ▲
EURASIP J. Inf. Secur.
6▼(last year 5)
11
J. Ambient Intell. Humaniz. Comput.
4▲ (new)
12 ▲
Neural Comput. Appl.
3– (last year: 3)
13
Biomed. Signal Process. Control.
3▲(new)
14
Int. J. Hum. Comput. Interact.
3▲ (new)
15
Phys. Commun.
2▲(new)
16
Int. J. Commun. Syst.
2▲(new)
17 ▲
Wirel. Pers. Commun.
2– (last year: 2)
By looking at this table and comparing with year 2023, some observations can be made.
First, the first position in the table has 45 withdrawn articles, which is much less than the 133 articles in Soft Computing from last year. In fact, Soft Computing is not in the table this year.
The top 3 positions in the table (J. Intell. Fuzzy Syst., Multim. Tools, Appl., and Ann. Oper Res.) were also in the table last year. Let’s analyse these three journals in more details:
For the Multimedia Tools and Applications journal, the number of withdrawn papers decreased from 37 to 27. But according to LetPub, the journal was expelled from the SCIE index on October 22, 2024.
The number of papers in the Journal of Intelligent Fuzzy Systems increased from 2 to 45. According to LetPub, this journal is still on hold.
As for the Annals of operations research Journal, the number of withdrawn papers increased from 2 to 23.
In the rest of the table, most of the entries are new. It is interesting that this year, there are no conferences in the ranking. Last year, it was quite surprising to see that the CVPR conference had three withdrawn papers.
Conclusion
This is short blog post to give some update about his metric of withdrawn papers per journals and conferences.
Hi all! Today, I write a blog post to announce that I have decided to close the data mining forum, which was hosted on this website at http://forum2.philippe-fournier-viger.com/ . The forum was a small website that was connected to my other websites. The forum was used for discussing about data mining topics and was powered by a version of PhPBB. I will now explain why I decided to close the forum and what happened.
First, let’s go back in time to a few weeks ago, in early March 2025. I was trying to access my main website, and I noticed that my website periodically became unavailable with this error: Error 500: Internal Server Error.
I would try to connect to any pages of my website and this error would sometimes occur and sometimes it would not. I first thought that there was a problem with the server hosting my website. I pay a webhosting company to host my websites. So, I logged into the administration panel of my websites and looked. I did not found anything suspicious. Then, after a few hours, my website started to work again, so I thought that it was a temporary problem and that it was solved.
But no! Today, on April 7th 2025, the website went down again and was barely accessible for the whole morning. Then, I started to investigate again. I decided to download all the access logs from the server to see if I could get some idea about what was going on.
Here is what I found. First, I looked at the summary of the HTTP requests to my websites by months:
As you can see above, the number of requests was around 3 million per monthin 2024 but suddenly in March it increased 10 times to around 32 million requests per month, which is extremely suspicious.
Then, I looked at the data for the first days of April, and I found that the number of requests even peaked at 6 million per day, which is a ridiculously high number for my small website.
Then, I looked into the detailed log and found that more than 90 % of the requests were coming from Brazil and were made to access different pages from my forum. Here is a sample of some of those requests:
As can be seen in the screenshot above, dozen of requests were sent from multiple IP addresses, mainly from Brazil with the same timestamp.
I then did a reverse lookup of some of these IP addresses to find where it came from and found that these IP addresses belong to some internet providers in Brazil.
It is not clear why this unusual traffic happened. But the most likely explanation is that some bots decided to try to spam my forum with advertisements and repeatedly tried to login and post. In my forum, the bots were unable to post since I required the manual approval for all new users. However, this did not discourage bots from accessing my webpage millions of times to the point of causing all my websites to go down.
Facing this situation, I had to decide whether to try to block all requests from Brazil, or to improve the security of the website or of the forum itself. But since all the requests were coming from different IPs, it is not simple. And I do not want to pay for some extra security service.
Thus, for this reason and because few people were using the forum in recent years, I have decided to just close it. As few people were using it, I think that it is not a big issue. In the future, I might prepare an alternative to the forum that will be more modern perhaps like a Reddit group or a WhatsApp group. If you have suggestions, feel free to let me know below in the comment section
And since I have closed the forum, the speed of this website and all my other websites has greatly increased!
So that’s the story about this! Hope this blog post has been interesting.
Update1(9th April – 1 day later) – traffic decrease: The number of HTTP requests has largely dropped after closing the forum:
This confirms that the forum was a magnet for bots and spam, and it was a good decision to close it.
Update 2 (12th April) – robots traffic, and CDN
I have done further analysis on the traffic to my websites, and it is also interesting to see that much traffic is by these bots:
And some bots do not bring any meaningful benefit to my websites. For instance, AhrefsBot and AwarioBot are primarily used for SEO monitoring and competitor analysis. Since I do not use these services, allowing their bots to crawl my website only consumes bandwidth without offering any benefits. Similarly, TurnitinBot index content for proprietary systems. Hence, to prevent these bots from crawling my website, I’ve added the following rewrite rule to my .htaccess file:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (GPTBot|AwarioBot|TurnitinBot|AhrefsBot|SemrushBot|DotBot) [NC] RewriteRule .* - [F,L]
This rule ensures that these bots receive a 403 Forbidden response and are effectively blocked from accessing any part of the website. This should improve a little bit more the website performance.
Besides, today I also reactivated the CDN (Content Delivery Network) with CloudFare for this website to boost the speed.
Genomic data is growing at an unprecedented rate, but storing and transmitting it efficiently remains a challenge. Several solutions have been proposed in the literature for compressing genome sequences in recent years such as JARVIS3, GeCo3, and NUHT. However, several of these methods face challenges such as high computational complexity, extended runtimes, limited generalization, interpretability issues, overfitting, and sensitivity to hyperparameters. In particular, deep-learning-based methods can have very long runtimes and operates as black-boxes
Thus, in this blog post, I want to introduce a new algorithm called HMG (2025) that we just published in the Information Fusion journal to mitigate these limitations of previous work:
Nawaz, M. Z., Nawaz, M. S., Fournier-Viger, P., Nawaz, S., Lin, J. C.-W., Tseng, V.S. (2025). Efficient Genome Sequence Compression via the Fusion of MDL-Based Heuristics. Information Fusion. Volume 120, 103083 DOI: https://doi.org/10.1016/j.inffus.2025.103083
The key idea of HMG is to use a pattern-mining approach, where patterns are extracted based on the MDL (Minimum Description Length) principle. More precisely, HMG tries to find the set of patterns (k-mers) in genome sequence that most succinctly describe them. Then, HMG uses these patterns to compress the genome sequences. Due to the very large search space of possible k-mers sets, a heuristic approach is used. More precisely, HMG consists of two algorithms called HMG-GA and HMG-SA, that respectively employ a genetic algorithm (GA) and simulated annealing (SA) to rapidly find a near optimal solution. Here is the flowchart of HMG (quite complex, but you may see the paper for details):
This novel approach for genome sequence compression has several advantages. In particular, it is very fast and achieves low bit-per-base compression.
In the paper, several experiments are presented on some benchmark datasets called D1, D2, D3 and D4 (see the paper for details). To give a glimpse about the results, the figure below from the paper show results for the bit-per-base (BPB) and compression ratios(CR) against several state-of-the-art genome sequence compressors:
In general, a lower BPB is better, and a higher CR is better. It can be seen in this figure that HMG-GA achieves very low BPB and comparable or high CR on all datasets.
But more importantly, HMG is very fast. Here is a figure comparing the compression and decompression times of various methods:
In this figure, the results are split into two sections: the compressors to the left of the blue vertical line are those that produced compression and decompression times for all four datasets. In contrast, the compressors to the right of the blue vertical line generated results for a subset of datasets (JARVIS3 and NUHT for DS1 only). It is found that HMG-GA(CC/SM) outperforms JARVIS2 and GeCo3 in both compression and decompression tasks. And among the methods that provide results for a subset of datasets, JARVIS3 emerges as the fastest, followed by NUHT.
Besides that, a very interesting advantage of the proposed HMG method is that it has multiple uses unlike several other genome sequence compressors. Because the set of patterns discovered by HMG are interpretable, they can also be used for the classification of genome sequences. In particular, in the HMG paper, we show different experiments about how the patterns that are generated for compression can be then reused to classify genome sequences. Here is one table for example that show classification accuracy using the patterns for various datasets (more details in the paper):
If you are interested by categorical data clustering, I am glad to announce that a new and up-to-date survey paper named “Categorical data clustering: 25 years beyond K-modes” will appear on this topic in the Expert Systems with Applications journal.
I am glad to have participated as co-author to this paper, which is the project of Prof. Tai Dinh, the main author. The survey paper provides an extensive coverage of categorical clustering, which includes for example algorithms such as k-means and others. There is also a Github repository with code that can be found in the paper.
The final version of the paper will be published soon by the journal. But you can already read the preprint version on Arxiv at this link: https://arxiv.org/abs/2408.17244
Update: here is the reference to the published paper:
Tai Dinh, Wong Hauchi, Philippe Fournier-Viger, Daniil Lisik, Minh-Quyet Ha, Hieu-Chi Dam, Van-Nam Huynh: Categorical data clustering: 25 years beyond K-modes. Expert Syst. Appl. 272: 126608 (2025)
Itemset mining is a data mining task for discovering patterns that appear frequently in transaction databases. In this context, a pattern, also called a frequent itemset, is a set of values that frequently occur together in transactions (records) of a database. Frequent itemset mining has many applications in various fields, but the traditional application is for the analysis of customer transactions. Given a database of customer transactions (sets of purchased items), applying an itemset mining algorithm can reveal the frequent itemsets, that is the set of items that are frequently bought together by customers.
In itemset mining research, two terms are often used, which are the concept of horizontal database and vertical database. They refer to how the data is represented. In this blog post, my goal is to explain what these terms mean.
There are two main ways of representing data for itemset mining. One way is to use a horizontal data format, where each row is a transaction, described by a transaction ID and a set of items in that transaction. For example, this is a horizontal database:
TID
Items
1
A,B,C
2
B,C,D
3
A,C,D
4
A,B,D
Here, the first row has the Transaction ID (TID) 1 and indicates that a customer has purchased some items A, B, and C, which could for example represent Apple, Bread and Cake. The other rows follow the same format.
The other main way of representing data for itemset mining is to use a vertical data format, where each column corresponds to an item and gives the list of transaction IDs that contain that item. For example, this is a vertical database:
Item
TID_set
A
1,3,4
B
1,2,4
C
1,2,3
D
2,3,4
For example, the first two indicates that the item A appears in the first, third and fourth transactions (that is the transactions with IDs 1, 3 and 4).
These two database formats are two different ways of representing the same data. In other words, it is possible to convert from one format to the other, that is to transform the first table into the second table, or transform the second table to obtain the first table.
The choice of the data format can affect the performance and scalability of itemset mining algorithms. In general, the horizontal data format is more suitable for breadth-firstsearch algorithms, such as the Apriori algorithm, which generates candidate itemsets level by level and scans the database multiple times to count their support. On the other hand, the vertical data format is more suitable for depth-first algorithms, such as the ECLAT algorithm, which generates candidate itemsets by intersecting the sets of transactions containing different itemsets.
Let me explain a little bit more about how the database format influences the design of itemset mining algorithms with an example. Lets say that we want to count the support (the number of occurrences of the itemset {A, B} (the purchase of items A and B together).
To count the support of {A, B} in the first table (using the horizontal format), we need to scan all the transactions from the database and count each one that contains A and B together. After reading the four lines of the first table, we find that {A, B} appears two times (has a support of 2).
Now if we want to count the support of {A, B} in the second table (using the vertical format, instead), it is more simple. We only need to do the intersection of the row of A and the row of B. Let me explain this in detail. In the row of A, we have the list of transactions 1, 3, and 4. In the row of B, we have the list of transactions 1, 2, and 4. By doing the intersection of these two lists, we find that {A,B} appears in transactions 1 and 2, and thus that the support of {A, B} is 2. Thus, if we use the vertical database format, it can be more efficient for counting the support of {A, B} because we do not need to read the whole database but just to look at the rows of A and B. This is one reason why vertical itemset mining algorithms can perform quite well in some situations (but not always).
Besides the horizontal and vertical data format, there of course exists other ways of representing the data from a database. Another way is to use prefix-trees, which is for example the internal data representation adopted by the FP-Growth algorithm.
That is all for day. I just wanted to explain the difference between horizontal and vertical database formats. Hope that this was informative and helpful.
— Philippe Fournier-Viger is a professor of computer science and founder of the SPMF data mining library.