用于 DNA 存储的高效、低复杂度变长编码。

IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-10-01 DOI:10.1186/s12859-024-05943-y
Yunfei Gao, Albert No
{"title":"用于 DNA 存储的高效、低复杂度变长编码。","authors":"Yunfei Gao, Albert No","doi":"10.1186/s12859-024-05943-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between <math><mrow><mo>[</mo> <mn>0.5</mn> <mo>-</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>,</mo> <mn>0.5</mn> <mo>+</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>]</mo></mrow> </math> (GC content constraint <math><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> </math> ). Sequencing or synthesis errors tend to increase when these constraints are violated.</p><p><strong>Results: </strong>In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when <math><mrow><mi>h</mi> <mo>=</mo> <mn>4</mn></mrow> </math> and <math> <mrow><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>=</mo> <mn>0.05</mn></mrow> </math> , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.</p><p><strong>Conclusion: </strong>We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446080/pdf/","citationCount":"0","resultStr":"{\"title\":\"Efficient and low-complexity variable-to-variable length coding for DNA storage.\",\"authors\":\"Yunfei Gao, Albert No\",\"doi\":\"10.1186/s12859-024-05943-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between <math><mrow><mo>[</mo> <mn>0.5</mn> <mo>-</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>,</mo> <mn>0.5</mn> <mo>+</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>]</mo></mrow> </math> (GC content constraint <math><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> </math> ). Sequencing or synthesis errors tend to increase when these constraints are violated.</p><p><strong>Results: </strong>In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when <math><mrow><mi>h</mi> <mo>=</mo> <mn>4</mn></mrow> </math> and <math> <mrow><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>=</mo> <mn>0.05</mn></mrow> </math> , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.</p><p><strong>Conclusion: </strong>We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.</p>\",\"PeriodicalId\":8958,\"journal\":{\"name\":\"BMC Bioinformatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446080/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12859-024-05943-y\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05943-y","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

摘要

背景:基于 DNA 的高效存储系统能以更低的成本提供巨大的容量和更长的寿命,从而应对预期的数据增长。然而,将数据编码到 DNA 序列中受到两个关键约束的限制:1) 最多有 h 个连续的相同碱基(同源多聚约束 h),以及 2) GC 比率在 [ 0.5 - c GC , 0.5 + c GC ] 之间(GC 含量约束 c GC)。当违反这些限制条件时,测序或合成错误往往会增加:在这项研究中,我们解决了 DNA 存储背景下的纯源编码问题,同时考虑了同源多聚物和 GC 含量约束。我们引入了一种新颖的编码技术,它既能遵守这些约束条件,又能在块长度增加时保持线性复杂性,并实现接近最优的速率。我们通过对随机生成的数据和现有文件进行实验,证明了所提方法的有效性。例如,当 h = 4 和 c GC = 0.05 时,速率达到 1.988,接近理论极限 1.990。相关代码可在 GitHub.Conclusion 上获取:我们提出了一种不依赖于连接预定义短序列的变长到变长编码方法,它能达到接近最优的速率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Efficient and low-complexity variable-to-variable length coding for DNA storage.

Background: Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated.

Results: In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 and c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.

Conclusion: We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Bioinformatics
BMC Bioinformatics 生物-生化研究方法
CiteScore
5.70
自引率
3.30%
发文量
506
审稿时长
4.3 months
期刊介绍: BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
期刊最新文献
Mining contextually meaningful subgraphs from a vertex-attributed graph. Robust double machine learning model with application to omics data. A mapping-free natural language processing-based technique for sequence search in nanopore long-reads. Closha 2.0: a bio-workflow design system for massive genome data analysis on high performance cluster infrastructure. DeepBP: Ensemble deep learning strategy for bioactive peptide prediction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1