{"title":"Explorer: efficient DNA coding by De Bruijn graph toward arbitrary local and global biochemical constraints.","authors":"Chang Dou, Yijie Yang, Fei Zhu, BingZhi Li, Yuping Duan","doi":"10.1093/bib/bbae363","DOIUrl":null,"url":null,"abstract":"<p><p>With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11285171/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae363","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.
随着数字数据的指数级增长,人们迫切需要创新的存储介质和技术。DNA 分子因其稳定性、存储容量和密度,为信息存储提供了一种前景广阔的解决方案。然而,DNA 存储也面临着诸多挑战,如复杂的生化限制和编码效率。本文介绍的 Explorer 是一种基于德布鲁因图的高效 DNA 编码算法,它充分利用了德布鲁因图描述局部序列的能力。Explorer 可在各种生化约束条件下进行编码,如同源多聚物、GC 含量和不想要的图案。本文还介绍了基于转换器架构的快速解码算法 Codeformer,以进一步提高解码效率。数值实验表明,与其他先进算法相比,Explorer 不仅能在各种生化约束条件下实现稳定的编码和解码,还能将编码效率和比特率提高 ¿10%。此外,Codeformer 还展示了高效解码大量 DNA 序列的能力。在不同的参数设置下,它的解码效率比传统算法高出两倍多。当 Codeformer 与 Reed-Solomon 码结合使用时,其解码精度超过 99%,是高速解码应用的理想选择。这些进展有望推动基于 DNA 的数据存储系统的发展,并促进对 DNA 作为新型信息存储介质的更广泛探索。
期刊介绍:
Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data.
The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.