A new version of SPMF (v2.64, november 2025)!

Today, I am happy to announce that a new version of the SPMF data mining library has been released.

It introduces five new algorithms:

  • the Carpenter algorithm (Pan et al., ICDM 2003), which is specialized for mining closed itemsets in transactions where the transactions contains a large number of items, and the number of transactions is relatively small. This is especially useful for biological data. The implementation is very efficient with multiple optimizations such as transaction merging.
  • the Carpenter Max algorithm, which is a version of Carpenter for mining maximal itemsets by post-processing after applying Carpenter.
  • The HMG_GA algorithms for discovering compressing sequential patterns in sequences using a genetic algorithm (M. Z. Nawaz et al., 2025)
  • The HMG_SA algorithms for discovering compressing sequential patterns in sequences using simulated annealing
  • And the GRIMP algorithm for discovering compressing itemsets in a transaction database using a genetic algorithm (M. Z. Nawaz et al., 2025).

Besides, some various other improvements have been made to the SPMF software, including a new tool in the GUI of SPMF called the Algorithm_Graph_Visualizer to visualize the similar algorithms in terms of input and output or categories as a graph (I had discussed about that in a recent blog post):

Besides, I have improved various graphical interface components such as the MemoryViewer, HistogramViewer, GraphViewer and SPMF Text Editor by fixing a few bugs and adding new features. I have also added a new mode in the TransactionDatabaseViewer to view transactions either as lists or as columns.

If everything goes, I might release more algorithms before the end of this year or early next year. I have several algorithms waiting to be integrated in SPMF!

To all users, thanks again for using the software! And to all contributors: thanks for your help!

If you have any algorithm that you would like to integrate in SPMF, feel free to contact with me by e-mail.

Posted in spmf, Uncategorized | Tagged , , , , , , , , | Leave a comment

GMP: A new algorithm for compressing protein sequences

Today, I am pleased to share that our team has published a new algorithm called GMP for the compression and analysis of protein sequences. The paper will appear next month at BIBM 2025. Here is the full reference:

Nawaz, M. Z., Nawaz, S., Fournier-Viger, P, Niu, X., Li, M. (2025). A Multipurpose Protein Compressor based on MDL and Genetic Algorithm. Proceedings of BIBM 2025.

And you can watch the video of the presentation on Youtube:
Presentation (18 minutes)

Abstract

The rapid expansion of protein sequence databases has created challenges for efficient storage, transmission, and analysis. Unlike genomic sequences with only four nucleotide bases, proteins are composed of twenty amino acids, making compression more complex. Existing specialized protein compressors, such as AC, AC2, and CPM-FCM, have achieved promising performance but still face limitations, including high computational cost, low adaptability, and limited biological interpretability. This paper introduces GMP (Genetic algorithm-based MDL Protein compressor), a novel protein compression framework that leverages the Minimum Description Length (MDL) principle with a genetic algorithm to discover optimal patterns of amino acid subsequences (kAA-mers). Experimental results demonstrate that GMP attains compression performance comparable to state-of-the-art methods while additionally supporting tasks such as classification and clustering—capabilities absent from traditional protein compressors. This makes GMP not only an efficient compression framework but also a biologically interpretable tool for protein sequence analysis. GMP is available at github.com/MuhammadzohaibNawaz/GMP.

Index Terms—Protein sequences, Compression, Genetic Algorithm, Minimum Description Length, kAA-mers

In summary

GMP was designed not only to compress protein sequences but also to provide insights into their structure through the discovery of meaningful subsequence patterns. By integrating MDL with a genetic algorithm, it strikes an effective balance between compression quality and interpretability. One of the unique strengths of GMP is that it can simultaneously serve multiple purposes: compression, classification, clustering, and pattern discovery—functions rarely combined in a single framework. Here is the main flowchart from the paper:

We will release the paper soon after it is published next month at BIBM 2025.

Posted in Uncategorized | Leave a comment

A new tool for visualizing algorithms from SPMF

Today, I wrote some code to visualize the relationships between the algorithms offered in the SPMF open-source pattern mining library. Here is the graph of all algorithms in SPMF (excluding tools) that have the same input and output type:

We can see a few big clusters such as high utility itemset mining algorithms:

Frequent itemset mining algorithms:

Closed itemset mining algorithms:

Episode rule mining algorithms:

Frequent sequential pattern mining algorithms:

Sequential rule mining algorithms:

Frequent episode mining algorithms:

If we only display the algorithms using edges if they have the same input file type (but may not have the output), the graph is different:

Now, there is a huge cluster for itemset mining and association rule mining:

And there is a big cluster for sequence mining algorithms:

And a smaller cluster for episode mining:

and a big cluster for high utility pattern mining:

The tool that I use to draw these pictures is the AlgorithmGraphVisualizer, a new GUI tool, which will be offered in the next version of SPMF. It allows to visualize algorithm relationships with different options and export to PNG. Here is the current GUI interface:

Hope that this has been interesting! The new version of SPMF should come out probably next week!

Posted in Uncategorized | Leave a comment

Fixing the reviewresponse.cls LaTeX Class to Allow Multi-Page Comments

Today, I will show how to fix the Latex reviewresponse.cls class to allow multi-page comments.

If you have ever written a detailed response to reviewers in LaTeX, you may have noticed that long reviewer comments sometimes get cut off instead of continuing on the next page. This happens because the comments are enclosed in non-breakable tcolorbox environments.

The Problem

In the original version of reviewresponse.cls, the environments for reviewer comments look something like this:

\newenvironment{generalcomment}{%
  \begin{tcolorbox}[attach title to upper,
    title={General Comments},
    after title={.\enskip},
    fonttitle={\bfseries},
    coltitle={colorcommentfg},
    colback={colorcommentbg},
    colframe={colorcommentframe},
  ]
}{\end{tcolorbox}}

\newenvironment{revcomment}[1][]{\refstepcounter{revcomment}
  \begin{tcolorbox}[adjusted title={Comment \arabic{revcomment}},
    fonttitle={\bfseries},
    colback={colorcommentbg},
    colframe={colorcommentframe},
    coltitle={colorcommentbg},
    #1
  ]
}{\end{tcolorbox}}

\newenvironment{changes}{\begin{tcolorbox}[colback={colorchangebg},
  colframe={colorchangeframe},enhanced jigsaw,]
}{\end{tcolorbox}}

These definitions produce nice colored boxes, but the problem is that tcolorbox by default does not break across pages. When your reviewer writes a long paragraph, LaTeX tries to keep the entire box on one page, which can result in missing text or strange layout issues.

The Solution

The fix is simple: you need to make the boxes breakable and enhanced. The tcolorbox package provides two key options for this:

  • breakable — allows the content to flow onto the next page.
  • enhanced jigsaw — ensures compatibility with decorations, titles, and other layout features when breaking boxes.

Here is the fixed version of the environments:

\newenvironment{generalcomment}{%
  \begin{tcolorbox}[
    enhanced jigsaw,
    breakable,
    attach title to upper,
    title={General Comments},
    after title={.\enskip},
    fonttitle={\bfseries},
    coltitle={colorcommentfg},
    colback={colorcommentbg},
    colframe={colorcommentframe},
  ]
}{\end{tcolorbox}}

\newenvironment{revcomment}[1][]{%
  \refstepcounter{revcomment}
  \begin{tcolorbox}[
    enhanced jigsaw,
    breakable,
    adjusted title={Comment \arabic{revcomment}},
    fonttitle={\bfseries},
    colback={colorcommentbg},
    colframe={colorcommentframe},
    coltitle={colorcommentbg},
    #1
  ]
}{\end{tcolorbox}}

\newenvironment{revresponse}[1][{}]{%
  \textbf{Response:} #1\par
}{\vspace{4em plus 0.2em minus 1.5em}}

\newenvironment{changes}{%
  \begin{tcolorbox}[
    enhanced jigsaw,
    breakable,
    colback={colorchangebg},
    colframe={colorchangeframe},
  ]
}{\end{tcolorbox}}

Result

After this modification, your reviewer comments and “changes” boxes will automatically continue onto the next page, no matter how long they are. You can now safely include large comments or detailed explanations without worrying about text being cut off.

Conclusion

By simply adding enhanced jigsaw and breakable to the tcolorbox environments, you make your LaTeX review responses much more robust. This small fix prevents truncated comments and keeps your document professional and reviewer-friendly.

Posted in Latex | Tagged , , , , | Leave a comment

How to fix reviewresponse.cls for custom reviewer numbering

Recently, I have found anice Latex class that can be used to write answers to reviewers for the rebuttal of journal papers. This latex class is called reviewresponse.cls, which can be found on GitHub. It allows to write an answer to reviewers with comments such as:

....

\reviewer

\begin{revcomment}
Figure 4 - please include legend to the right or below the main figure as in panel b legend overlaps with line of plot making confusion i interpretation. gentle grey grid in backround will also be valuable for plot investigation.
\end{revcomment}
\begin{revresponse}
    [your answer]
\end{revresponse}
\begin{changes}
    some changes you made
\end{changes}

\begin{revcomment}
    No avaliable implementation.
\end{revcomment}
\begin{revresponse}
     [your answer]
\end{revresponse}
\begin{changes}
    some changes you made
\end{changes}

which will then generate something beautiful like:

However, I have found a problem with this class, which is that the reviewers are automatically numbered as Reviewer 1, 2, 3, 4, 5…. But, in several cases, the reviewers are not numbered sequentially and some numbers may be skipped.

To fix this issue, the solution is to redefine the /reviewer command in reviewresponse.cls as follows:

\newcommand*{\reviewer}[1][]{%
  \clearpage
  % If no optional argument, step the counter as before.
  \if\relax\detokenize{#1}\relax
    \refstepcounter{reviewer}%
  \else
    % If an argument was given, set reviewer to N-1 then refstep to N.
    % Using \numexpr avoids the off-by-one problem while keeping refstepcounter
    % (so labels/anchors behave correctly).
    \setcounter{reviewer}{\numexpr#1-1\relax}%
    \refstepcounter{reviewer}%
  \fi
  \@ifundefined{pdfbookmark}{}{%
    \pdfbookmark[1]{Reviewer \arabic{reviewer}}{hyperref@reviewer\arabic{reviewer}}%
  }%
  \section*{Authors' Response to Reviewer~\arabic{reviewer}}
}

After making this modification, the \reviewer command can now be used in your latex document with a parameter to specify the reviewer number that you want, like this: \reviewer[5]. The result then looks like this:


And now the problem is fixed.

That is all for today, I just wanted to share this solution in case someone has the same problem with reviewresponse.cls.

Posted in Latex | Leave a comment

The Conference Hotel Booking Scam

Something interesting happened to me in the last few days. To my knowledge, this seems to be a scam, and to be something relatively new, so I want to share the information.

Here is the context. I will be a keynote speaker at a conference in Asia in a few months, and out of the blue, a company that appeared to be based in the Netherlands contacted me a few days ago by email offering to arrange my hotel accommodation. At first, the email from “ExploreEra Reservations” (reservation.nl@exploreera.info) looked very professional. They mentioned the conference location and month, and politely asked for my exact arrival and departure dates to reserve my hotel room. Their email was worded in the kind of tone you might expect from a real conference travel desk. Here is a screenshot:

But there was some red flag already in this e-mail, such as indicating that they require 30 days to cancel the reservation, which is highly unusual. In fact, a hotel reservation can in general be cancelled in 24 hours for most hotels without fees. But I still responded with basic details about my dates to see what they would say. In the follow-up email, there was more serious red flags. Here is a screenshot:

At about the same time as this, in a separated e-mail, they sent me a PandaDoc form for a hotel booking with a proposed rate of €200 per night, while also asking for personal information and a signature, and there was a weird disclaimer in small print indicating that they are not affiliated to the conference (very suspicious!), and there are HUGE cancellation fees:


Thus, I decided to investigate this. I Googled the proposed hotel name and found that their real rate is more like 20-50 euros per night on Booking DOT com, not 199 euros.

Then, I googled their organization — ExploreEra.info — and quickly discovered that at least two conferences have issued very serious warnings about emails from this domain approaching their attendees to book hotels on their behalf without authorization.

For example, the World Psychiatric Association (WPA) posted an alert noting that emails from ExploreEra.info have been contacting their delegates, pretending to arrange accommodation on behalf of the conference. Here is a screenshot of this warning:

Another event also issued a similar warning:


So, is this a scam? Well, in the emails I have received, they never mentioned directly that they work for the conference, but the emails are worded in a way that gives this impression. And based on the above warnings from other conferences, and the apparently inflated price and 30 days cancellation policy, it seems indeed to be a scam. Thus, be warned!

By the way, there are several messages on Twitter warning about similar schemes, although I dont know if it is from the same people:

Posted in Academia | Tagged , , , , | Leave a comment

Huge traffic from a botnet looking for datasets

Today, I received an e-mail from the Web hosting company indicating that my website had exceeded the bandwidth limit of the content delivery network (CDN) for my package. I was quite surprised. Hence, I checked the control panel, and I saw a huge increase in bandwidth for the last three days, as shown below (in GBs).

By looking at the logs, I saw that some bots from thousands of different addresses where trying to access datasets from the SPMF website using malformed URLs with multiple times the word “datasets” inside. Here is an excerpt from the logs:

These URLs do not exist, however due to the configurations of the server, they were redirected to the real dataset URLs thus and consuming a huge amount of bandwidth.

Since all the requests came from different IPs from dozen different countries, it would not be realistic to ban all the IP addresses.

Thus, I have check how to fix the configuration. Finally, I modified the .htaccess file of the server to block malformed requests and also deactivate the default fuzzy URL matching done by the server to match paths that dont exist with real paths on the server. This may have caused some slight issue on the website during the last hours. But now, I think that the problem is fixed and the website will be faster!

So why my website was flooded by requests for datasets? I think that the most likely reason is that some people have decided to launch a web scraping botnet for data, and that the bot is buggy such that it would recursively add /datasets/ to the same path dozens of times like in this URL:

… /spmf/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/datasets/costtrans/datasets/onshelf/datasets/husp/datasets/husp/datasets/husp/BIBLE_sequence_utility.txt

Than the botnet would not realize that it is actually always downloading the same files over and over again from similar URLs….

Update a few hours later: I see that my new rules in .htaccess are working as now all invalid requests are now blocked:

Posted in Website | Tagged , | Leave a comment

New version of SPMF: 2.63!

This is just a short blog post to let you know that SPMF 2.63 is released on the SPMF website, with some new features such as the Visual Pattern Viewer (to display patterns visually), and also some new algorithms and several bug fixes.

The list of changes in SPMF 2.64 can be found on the download page.

Here, just to give you a glimpse of this new version, here is a screenshot of the Visual Pattern Viewer, utilized for viewing sequential patterns:

The Visual Pattern Viewer can be used to explore patterns visually with search and filter function and can display a large number of different pattern types like association rules, itemsets and more.

And here is the list of new algorithms in SPMF 2.63:

  • The EMDO algorithm for mining frequent parallel episodes and episode rules in complex event sequences by counting distinct occurrences (thanks to Oualid Ouarem et al. for the original code).
  • the EMDO-Rules for generating episode rules from parallel episodes found by EMDO. (Ouarem et al., 2024) new
  • The RMiner algorithm for high utility itemset mining (thanks to Pushp Sra et al. for the original code).
  • The ScentedUtilityMiner algorithm for high utility itemset mining with a recency constraint using reinduction counters (thanks to Pushp Sra et al. for the original code).
  • The Density Peak Clustering (DPC) algorithm for clustering vectors of numbers
  • The AEDBScan algorithm for clustering vectors of numbers

What’s next?

The development of SPMF is always ongoing. If you want to contribute code of new algorithms, feel free to contact with me with your code.

Hope you enjoy this new version of SPMF. For the next version 2.64, there will be more new algorithms and some further performance improvement. I expect that it would be released this Autumn.

Posted in open-source, spmf | Tagged , , , , , , , , , , | Leave a comment

Upcoming feature of SPMF 2.63: Taxonomy Viewer

Today, I want to share briefly a new feature of the upcoming SPMF version 2.63, which is the taxonomy viewer. This tools allow to visualize a taxonomy used by algorithms such as CLH-Miner and FEACP.

The user interface is for now quite simple and looks like this:

In this example, I display a file called transaction_CLHMiner.txt, which defines the taxonomy:
1,6
2,6
3,7
4,8
5,8
6,7
9,1
10,1

And a transaction database given names to these items in a file taxonomy_CLHMiner.txt:

@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=milk
@ITEM=4=bread
@ITEM=5=bagel
@ITEM=6=orange
@ITEM=7=FreshProducts
@ITEM=8=BreadProducts
@ITEM=9=red_apple
@ITEM=10=green_apple
1 3:6:5 1
5:3:3
1 2 3 4 5:25:5 10 1 6 3
2 3 4 5:20:8 3 6 3
1 3 4:8:5 1 2
1 3 5:22:10 6 6
2 3 5:9:4 2 3

This is just to give you a preview of some new features of SPMF. The next version should be released in about 1 week!

Posted in open-source, Pattern Mining | Tagged , , , , | Leave a comment

CFP: The OCSA 2025 conference

Today, I would like to share the call for papers for the upcoming OCSA 2025 conference, which will be held in Changsha, capital of the Hunan province of China. The details of the conference are as follows:

International Conference on Optoelectronics, Computer Science, and Algorithms (OCSA 2025)
Website: www.icocsa.net
Email: ocsa@163.com
Dates: September 19-21, 2025
Venue: Changsha, China
Indexing: EI database (pending approval)
Submission link: https://ocs.academiccenter.com/manager/dashboard

If you are interested, you may consider submitting a paper.

Posted in cfp | Tagged , | Leave a comment