A lossless reference-free sequence compression algorithm leveraging grammatical, statistical, and substitution rules.

IF 2.5 3区生物学 Q3 BIOTECHNOLOGY & APPLIED MICROBIOLOGY Briefings in Functional Genomics Pub Date : 2025-01-15 DOI:10.1093/bfgp/elae050

Subhankar Roy, Dilip Kumar Maity, Anirban Mukhopadhyay

{"title":"A lossless reference-free sequence compression algorithm leveraging grammatical, statistical, and substitution rules.","authors":"Subhankar Roy, Dilip Kumar Maity, Anirban Mukhopadhyay","doi":"10.1093/bfgp/elae050","DOIUrl":null,"url":null,"abstract":"<p><p>Deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence compressors for novel species frequently face challenges when processing wide-scale raw, FASTA, or multi-FASTA structured data. For years, molecular sequence databases have favored the widely used general-purpose Gzip and Zstd compressors. The absence of sequence-specific characteristics in these encoders results in subpar performance, and their use depends on time-consuming parameter adjustments. To address these limitations, in this article, we propose a reference-free, lossless sequence compressor called GraSS (Grammatical, Statistical, and Substitution Rule-Based). GraSS compresses sequences more effectively by taking advantage of certain characteristics seen in DNA and RNA sequences. It supports various formats, including raw, FASTA, and multi-FASTA, commonly found in GenBank DNA and RNA files. We evaluate GraSS's performance using ten benchmark DNA sequences with reduced number of repeats, two highly repetitive RNA sequences, and fifteen raw DNA sequences. Test results indicate that the weighted average compression ratios (WACR) for DNA and RNA sequences are 4.5 and 19.6, respectively. Additionally, the entire DNA sequence corpus has a total compression time (TCT) of 246.8 seconds (s). These results demonstrate that the proposed compression method performs better than several advanced algorithms specifically designed to handle various levels of sequence redundancy. The decompression times, memory usage, and CPU usage are also very competitive. Contact: anirban@klyuniv.ac.in.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11735755/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in Functional Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bfgp/elae050","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence compressors for novel species frequently face challenges when processing wide-scale raw, FASTA, or multi-FASTA structured data. For years, molecular sequence databases have favored the widely used general-purpose Gzip and Zstd compressors. The absence of sequence-specific characteristics in these encoders results in subpar performance, and their use depends on time-consuming parameter adjustments. To address these limitations, in this article, we propose a reference-free, lossless sequence compressor called GraSS (Grammatical, Statistical, and Substitution Rule-Based). GraSS compresses sequences more effectively by taking advantage of certain characteristics seen in DNA and RNA sequences. It supports various formats, including raw, FASTA, and multi-FASTA, commonly found in GenBank DNA and RNA files. We evaluate GraSS's performance using ten benchmark DNA sequences with reduced number of repeats, two highly repetitive RNA sequences, and fifteen raw DNA sequences. Test results indicate that the weighted average compression ratios (WACR) for DNA and RNA sequences are 4.5 and 19.6, respectively. Additionally, the entire DNA sequence corpus has a total compression time (TCT) of 246.8 seconds (s). These results demonstrate that the proposed compression method performs better than several advanced algorithms specifically designed to handle various levels of sequence redundancy. The decompression times, memory usage, and CPU usage are also very competitive. Contact: anirban@klyuniv.ac.in.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用语法、统计和替换规则的无损无引用序列压缩算法。

用于新物种的脱氧核糖核酸（DNA）或核糖核酸（RNA）序列压缩器在处理大规模的原始、FASTA或多FASTA结构化数据时经常面临挑战。多年来，分子序列数据库一直青睐于广泛使用的通用Gzip和Zstd压缩器。在这些编码器中缺乏序列特定的特性导致性能低于标准，并且它们的使用依赖于耗时的参数调整。为了解决这些限制，在本文中，我们提出了一个无引用的无损序列压缩器，称为GraSS（基于语法、统计和替换规则）。GraSS通过利用DNA和RNA序列中的某些特征更有效地压缩序列。它支持各种格式，包括原始，FASTA和多FASTA，常见于GenBank DNA和RNA文件。我们使用10个重复次数减少的基准DNA序列、两个高度重复的RNA序列和15个原始DNA序列来评估GraSS的性能。测试结果表明，DNA和RNA序列的加权平均压缩比（WACR）分别为4.5和19.6。此外，整个DNA序列语料库的总压缩时间（TCT）为246.8秒(s)。这些结果表明，所提出的压缩方法比专门设计用于处理不同级别序列冗余的几种高级算法性能更好。解压时间、内存使用和CPU使用也非常有竞争力。联系:anirban@klyuniv.ac.in。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Briefings in Functional Genomics BIOTECHNOLOGY & APPLIED MICROBIOLOGY-GENETICS & HEREDITY

CiteScore

6.30

自引率

2.50%

发文量

审稿时长

6-12 weeks

期刊介绍： Briefings in Functional Genomics publishes high quality peer reviewed articles that focus on the use, development or exploitation of genomic approaches, and their application to all areas of biological research. As well as exploring thematic areas where these techniques and protocols are being used, articles review the impact that these approaches have had, or are likely to have, on their field. Subjects covered by the Journal include but are not restricted to: the identification and functional characterisation of coding and non-coding features in genomes, microarray technologies, gene expression profiling, next generation sequencing, pharmacogenomics, phenomics, SNP technologies, transgenic systems, mutation screens and genotyping. Articles range in scope and depth from the introductory level to specific details of protocols and analyses, encompassing bacterial, fungal, plant, animal and human data. The editorial board welcome the submission of review articles for publication. Essential criteria for the publication of papers is that they do not contain primary data, and that they are high quality, clearly written review articles which provide a balanced, highly informative and up to date perspective to researchers in the field of functional genomics.