Genomic data is growing at an unprecedented rate, but storing and transmitting it efficiently remains a challenge. Several solutions have been proposed in the literature for compressing genome sequences in recent years such as JARVIS3, GeCo3, and NUHT. However, several of these methods face challenges such as high computational complexity, extended runtimes, limited generalization, interpretability issues, overfitting, and sensitivity to hyperparameters. In particular, deep-learning-based methods can have very long runtimes and operates as black-boxes
Thus, in this blog post, I want to introduce a new algorithm called HMG (2025) that we just published in the Information Fusion journal to mitigate these limitations of previous work:
Nawaz, M. Z., Nawaz, M. S., Fournier-Viger, P., Nawaz, S., Lin, J. C.-W., Tseng, V.S. (2025). Efficient Genome Sequence Compression via the Fusion of MDL-Based Heuristics. Information Fusion. Volume 120, 103083 DOI: https://doi.org/10.1016/j.inffus.2025.103083 |
The key idea of HMG is to use a pattern-mining approach, where patterns are extracted based on the MDL (Minimum Description Length) principle. More precisely, HMG tries to find the set of patterns (k-mers) in genome sequence that most succinctly describe them. Then, HMG uses these patterns to compress the genome sequences. Due to the very large search space of possible k-mers sets, a heuristic approach is used. More precisely, HMG consists of two algorithms called HMG-GA and HMG-SA, that respectively employ a genetic algorithm (GA) and simulated annealing (SA) to rapidly find a near optimal solution. Here is the flowchart of HMG (quite complex, but you may see the paper for details):
This novel approach for genome sequence compression has several advantages. In particular, it is very fast and achieves low bit-per-base compression.
In the paper, several experiments are presented on some benchmark datasets called D1, D2, D3 and D4 (see the paper for details). To give a glimpse about the results, the figure below from the paper show results for the bit-per-base (BPB) and compression ratios (CR) against several state-of-the-art genome sequence compressors:
In general, a lower BPB is better, and a higher CR is better. It can be seen in this figure that HMG-GA achieves very low BPB and comparable or high CR on all datasets.
But more importantly, HMG is very fast. Here is a figure comparing the compression and decompression times of various methods:
In this figure, the results are split into two sections: the compressors to the left of the blue vertical line are those that produced compression and decompression times for all four datasets. In contrast, the compressors to the right of the blue vertical line generated results for a subset of datasets (JARVIS3 and NUHT for DS1 only). It is found that HMG-GA(CC/SM) outperforms JARVIS2 and GeCo3 in both compression and decompression tasks. And among the methods that provide results for a subset of datasets, JARVIS3 emerges as the fastest, followed by NUHT.
Besides that, a very interesting advantage of the proposed HMG method is that it has multiple uses unlike several other genome sequence compressors. Because the set of patterns discovered by HMG are interpretable, they can also be used for the classification of genome sequences. In particular, in the HMG paper, we show different experiments about how the patterns that are generated for compression can be then reused to classify genome sequences. Here is one table for example that show classification accuracy using the patterns for various datasets (more details in the paper):

The source code and datasets of the HMG genome sequence compressor are public. The link for accessing them is https://github.com/MuhammadzohaibNawaz/HMG
Hope that this blog post has been interesting!