GMP: A new algorithm for compressing protein sequences

Today, I am pleased to share that our team has published a new algorithm called GMP for the compression and analysis of protein sequences. The paper will appear next month at BIBM 2025. Here is the full reference:

Nawaz, M. Z., Nawaz, S., Fournier-Viger, P, Niu, X., Li, M. (2025). A Multipurpose Protein Compressor based on MDL and Genetic Algorithm. Proceedings of BIBM 2025.

And you can watch the video of the presentation on Youtube:
Presentation (18 minutes)

Abstract

The rapid expansion of protein sequence databases has created challenges for efficient storage, transmission, and analysis. Unlike genomic sequences with only four nucleotide bases, proteins are composed of twenty amino acids, making compression more complex. Existing specialized protein compressors, such as AC, AC2, and CPM-FCM, have achieved promising performance but still face limitations, including high computational cost, low adaptability, and limited biological interpretability. This paper introduces GMP (Genetic algorithm-based MDL Protein compressor), a novel protein compression framework that leverages the Minimum Description Length (MDL) principle with a genetic algorithm to discover optimal patterns of amino acid subsequences (kAA-mers). Experimental results demonstrate that GMP attains compression performance comparable to state-of-the-art methods while additionally supporting tasks such as classification and clustering—capabilities absent from traditional protein compressors. This makes GMP not only an efficient compression framework but also a biologically interpretable tool for protein sequence analysis. GMP is available at github.com/MuhammadzohaibNawaz/GMP.

Index Terms—Protein sequences, Compression, Genetic Algorithm, Minimum Description Length, kAA-mers

In summary

GMP was designed not only to compress protein sequences but also to provide insights into their structure through the discovery of meaningful subsequence patterns. By integrating MDL with a genetic algorithm, it strikes an effective balance between compression quality and interpretability. One of the unique strengths of GMP is that it can simultaneously serve multiple purposes: compression, classification, clustering, and pattern discovery—functions rarely combined in a single framework. Here is the main flowchart from the paper:

We will release the paper soon after it is published next month at BIBM 2025.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *