利用语法、统计和替换规则的无损无引用序列压缩算法。

IF 2.5 3区 生物学 Q3 BIOTECHNOLOGY & APPLIED MICROBIOLOGY Briefings in Functional Genomics Pub Date : 2025-01-15 DOI:10.1093/bfgp/elae050
Subhankar Roy, Dilip Kumar Maity, Anirban Mukhopadhyay
{"title":"利用语法、统计和替换规则的无损无引用序列压缩算法。","authors":"Subhankar Roy, Dilip Kumar Maity, Anirban Mukhopadhyay","doi":"10.1093/bfgp/elae050","DOIUrl":null,"url":null,"abstract":"<p><p>Deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence compressors for novel species frequently face challenges when processing wide-scale raw, FASTA, or multi-FASTA structured data. For years, molecular sequence databases have favored the widely used general-purpose Gzip and Zstd compressors. The absence of sequence-specific characteristics in these encoders results in subpar performance, and their use depends on time-consuming parameter adjustments. To address these limitations, in this article, we propose a reference-free, lossless sequence compressor called GraSS (Grammatical, Statistical, and Substitution Rule-Based). GraSS compresses sequences more effectively by taking advantage of certain characteristics seen in DNA and RNA sequences. It supports various formats, including raw, FASTA, and multi-FASTA, commonly found in GenBank DNA and RNA files. We evaluate GraSS's performance using ten benchmark DNA sequences with reduced number of repeats, two highly repetitive RNA sequences, and fifteen raw DNA sequences. Test results indicate that the weighted average compression ratios (WACR) for DNA and RNA sequences are 4.5 and 19.6, respectively. Additionally, the entire DNA sequence corpus has a total compression time (TCT) of 246.8 seconds (s). These results demonstrate that the proposed compression method performs better than several advanced algorithms specifically designed to handle various levels of sequence redundancy. The decompression times, memory usage, and CPU usage are also very competitive. Contact:  anirban@klyuniv.ac.in.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11735755/pdf/","citationCount":"0","resultStr":"{\"title\":\"A lossless reference-free sequence compression algorithm leveraging grammatical, statistical, and substitution rules.\",\"authors\":\"Subhankar Roy, Dilip Kumar Maity, Anirban Mukhopadhyay\",\"doi\":\"10.1093/bfgp/elae050\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence compressors for novel species frequently face challenges when processing wide-scale raw, FASTA, or multi-FASTA structured data. For years, molecular sequence databases have favored the widely used general-purpose Gzip and Zstd compressors. The absence of sequence-specific characteristics in these encoders results in subpar performance, and their use depends on time-consuming parameter adjustments. To address these limitations, in this article, we propose a reference-free, lossless sequence compressor called GraSS (Grammatical, Statistical, and Substitution Rule-Based). GraSS compresses sequences more effectively by taking advantage of certain characteristics seen in DNA and RNA sequences. It supports various formats, including raw, FASTA, and multi-FASTA, commonly found in GenBank DNA and RNA files. We evaluate GraSS's performance using ten benchmark DNA sequences with reduced number of repeats, two highly repetitive RNA sequences, and fifteen raw DNA sequences. Test results indicate that the weighted average compression ratios (WACR) for DNA and RNA sequences are 4.5 and 19.6, respectively. Additionally, the entire DNA sequence corpus has a total compression time (TCT) of 246.8 seconds (s). These results demonstrate that the proposed compression method performs better than several advanced algorithms specifically designed to handle various levels of sequence redundancy. The decompression times, memory usage, and CPU usage are also very competitive. Contact:  anirban@klyuniv.ac.in.</p>\",\"PeriodicalId\":55323,\"journal\":{\"name\":\"Briefings in Functional Genomics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11735755/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Briefings in Functional Genomics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bfgp/elae050\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in Functional Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bfgp/elae050","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

用于新物种的脱氧核糖核酸(DNA)或核糖核酸(RNA)序列压缩器在处理大规模的原始、FASTA或多FASTA结构化数据时经常面临挑战。多年来,分子序列数据库一直青睐于广泛使用的通用Gzip和Zstd压缩器。在这些编码器中缺乏序列特定的特性导致性能低于标准,并且它们的使用依赖于耗时的参数调整。为了解决这些限制,在本文中,我们提出了一个无引用的无损序列压缩器,称为GraSS(基于语法、统计和替换规则)。GraSS通过利用DNA和RNA序列中的某些特征更有效地压缩序列。它支持各种格式,包括原始,FASTA和多FASTA,常见于GenBank DNA和RNA文件。我们使用10个重复次数减少的基准DNA序列、两个高度重复的RNA序列和15个原始DNA序列来评估GraSS的性能。测试结果表明,DNA和RNA序列的加权平均压缩比(WACR)分别为4.5和19.6。此外,整个DNA序列语料库的总压缩时间(TCT)为246.8秒(s)。这些结果表明,所提出的压缩方法比专门设计用于处理不同级别序列冗余的几种高级算法性能更好。解压时间、内存使用和CPU使用也非常有竞争力。联系:anirban@klyuniv.ac.in。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A lossless reference-free sequence compression algorithm leveraging grammatical, statistical, and substitution rules.

Deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence compressors for novel species frequently face challenges when processing wide-scale raw, FASTA, or multi-FASTA structured data. For years, molecular sequence databases have favored the widely used general-purpose Gzip and Zstd compressors. The absence of sequence-specific characteristics in these encoders results in subpar performance, and their use depends on time-consuming parameter adjustments. To address these limitations, in this article, we propose a reference-free, lossless sequence compressor called GraSS (Grammatical, Statistical, and Substitution Rule-Based). GraSS compresses sequences more effectively by taking advantage of certain characteristics seen in DNA and RNA sequences. It supports various formats, including raw, FASTA, and multi-FASTA, commonly found in GenBank DNA and RNA files. We evaluate GraSS's performance using ten benchmark DNA sequences with reduced number of repeats, two highly repetitive RNA sequences, and fifteen raw DNA sequences. Test results indicate that the weighted average compression ratios (WACR) for DNA and RNA sequences are 4.5 and 19.6, respectively. Additionally, the entire DNA sequence corpus has a total compression time (TCT) of 246.8 seconds (s). These results demonstrate that the proposed compression method performs better than several advanced algorithms specifically designed to handle various levels of sequence redundancy. The decompression times, memory usage, and CPU usage are also very competitive. Contact:  anirban@klyuniv.ac.in.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Briefings in Functional Genomics
Briefings in Functional Genomics BIOTECHNOLOGY & APPLIED MICROBIOLOGY-GENETICS & HEREDITY
CiteScore
6.30
自引率
2.50%
发文量
37
审稿时长
6-12 weeks
期刊介绍: Briefings in Functional Genomics publishes high quality peer reviewed articles that focus on the use, development or exploitation of genomic approaches, and their application to all areas of biological research. As well as exploring thematic areas where these techniques and protocols are being used, articles review the impact that these approaches have had, or are likely to have, on their field. Subjects covered by the Journal include but are not restricted to: the identification and functional characterisation of coding and non-coding features in genomes, microarray technologies, gene expression profiling, next generation sequencing, pharmacogenomics, phenomics, SNP technologies, transgenic systems, mutation screens and genotyping. Articles range in scope and depth from the introductory level to specific details of protocols and analyses, encompassing bacterial, fungal, plant, animal and human data. The editorial board welcome the submission of review articles for publication. Essential criteria for the publication of papers is that they do not contain primary data, and that they are high quality, clearly written review articles which provide a balanced, highly informative and up to date perspective to researchers in the field of functional genomics.
期刊最新文献
Systematic analysis of the transcriptional landscape of melanoma reveals drug-target expression plasticity. Exploring the impact of N4-acetylcytidine modification in RNA on non-neoplastic disease: unveiling its role in pathogenesis and therapeutic opportunities. Crosstalk between genomic variants and DNA methylation in FLT3 mutant acute myeloid leukemia. Beyond the hype: using AI, big data, wearable devices, and the internet of things for high-throughput livestock phenotyping. Environmental community transcriptomics: strategies and struggles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1