M. Zohaib Nawaz , M. Saqib Nawaz , Philippe Fournier-Viger , Shoaib Nawaz , Jerry Chun-Wei Lin , Vincent S. Tseng
{"title":"Efficient genome sequence compression via the fusion of MDL-based heuristics","authors":"M. Zohaib Nawaz , M. Saqib Nawaz , Philippe Fournier-Viger , Shoaib Nawaz , Jerry Chun-Wei Lin , Vincent S. Tseng","doi":"10.1016/j.inffus.2025.103083","DOIUrl":null,"url":null,"abstract":"<div><div>Developing novel methods for the efficient and lossless compression of genome sequences has become a pressing issue in bioinformatics due to the rapidly increasing volume of genomic data. Although recent reference-free genome compressors have shown potential, they often require substantial computational resources, lack interpretability, and fail to fully utilize the inherent sequential characteristics of genome sequences. To overcome these limitations, this paper presents HMG (Heuristic-driven MDL-based Genome sequence compressor), a novel compressor based on the Minimum Description Length (MDL) principle. HMG is designed to identify the optimal set of k-mers (patterns) for the maximal compression of a dataset. By fusing heuristic algorithms—specifically the Genetic Algorithm and Simulated Annealing—with the MDL framework, HMG effectively navigates the extensive search space of k-mer patterns. An experimental comparison with state-of-the-art genome compressors shows that HMG is fast, and achieves a low bit-per-base. Furthermore, the optimal k-mers derived by HMG for compression are employed for genome classification, thereby offering multifunctional advantages over previous genome compressors. HMG is available at <span><span>https://github.com/MuhammadzohaibNawaz/HMG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"120 ","pages":"Article 103083"},"PeriodicalIF":15.5000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525001563","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Developing novel methods for the efficient and lossless compression of genome sequences has become a pressing issue in bioinformatics due to the rapidly increasing volume of genomic data. Although recent reference-free genome compressors have shown potential, they often require substantial computational resources, lack interpretability, and fail to fully utilize the inherent sequential characteristics of genome sequences. To overcome these limitations, this paper presents HMG (Heuristic-driven MDL-based Genome sequence compressor), a novel compressor based on the Minimum Description Length (MDL) principle. HMG is designed to identify the optimal set of k-mers (patterns) for the maximal compression of a dataset. By fusing heuristic algorithms—specifically the Genetic Algorithm and Simulated Annealing—with the MDL framework, HMG effectively navigates the extensive search space of k-mer patterns. An experimental comparison with state-of-the-art genome compressors shows that HMG is fast, and achieves a low bit-per-base. Furthermore, the optimal k-mers derived by HMG for compression are employed for genome classification, thereby offering multifunctional advantages over previous genome compressors. HMG is available at https://github.com/MuhammadzohaibNawaz/HMG.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.