{"title":"The Capacity of Secondary Structure Avoidance Codes for DNA Sequences","authors":"Chen Wang;Hui Chu;Gennian Ge;Yiwei Zhang","doi":"10.1109/TMBMC.2024.3396404","DOIUrl":null,"url":null,"abstract":"In DNA sequences, we have the celebrated Watson-Crick complement \n<inline-formula> <tex-math>$\\overline {T}=A, \\overline {A}=T, \\overline {C}=G$ </tex-math></inline-formula>\n, and \n<inline-formula> <tex-math>$\\overline {G}=C$ </tex-math></inline-formula>\n. The phenomenon of secondary structure refers to the tendency of a single stranded DNA sequence to fold back upon itself, which is usually caused by the existence of two non-overlapping reverse complement substrings. The property of secondary structure avoidance (SSA) forbids a sequence to contain such reverse complement substrings, and it is a key criterion in the design of single-stranded DNA sequences for both DNA storage and DNA computing. In this paper, we prove that the problem of constructing SSA sequences for any given secondary structure stem length \n<italic>m</i>\n can be characterized by a constrained system, and thus the capacity of SSA sequences can be calculated by the classic spectral radius approach in constrained coding theory. We analyze how to choose the generating set, which is a subset of vertices in a de Bruijn graph, for the constrained system, which leads to some explicit constructions of SSA codes. In particular, our constructions have optimal rates 1.1679bits/nt and 1.5515bits/nt when \n<inline-formula> <tex-math>${m} = 2$ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>${m} = 3$ </tex-math></inline-formula>\n, respectively. In addition, we combine the SSA constraint together with the homopolymer run-length-limit constraint and analyze the capacity of sequences satisfying both constraints.","PeriodicalId":36530,"journal":{"name":"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10517954/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
In DNA sequences, we have the celebrated Watson-Crick complement
$\overline {T}=A, \overline {A}=T, \overline {C}=G$
, and
$\overline {G}=C$
. The phenomenon of secondary structure refers to the tendency of a single stranded DNA sequence to fold back upon itself, which is usually caused by the existence of two non-overlapping reverse complement substrings. The property of secondary structure avoidance (SSA) forbids a sequence to contain such reverse complement substrings, and it is a key criterion in the design of single-stranded DNA sequences for both DNA storage and DNA computing. In this paper, we prove that the problem of constructing SSA sequences for any given secondary structure stem length
m
can be characterized by a constrained system, and thus the capacity of SSA sequences can be calculated by the classic spectral radius approach in constrained coding theory. We analyze how to choose the generating set, which is a subset of vertices in a de Bruijn graph, for the constrained system, which leads to some explicit constructions of SSA codes. In particular, our constructions have optimal rates 1.1679bits/nt and 1.5515bits/nt when
${m} = 2$
and
${m} = 3$
, respectively. In addition, we combine the SSA constraint together with the homopolymer run-length-limit constraint and analyze the capacity of sequences satisfying both constraints.
期刊介绍:
As a result of recent advances in MEMS/NEMS and systems biology, as well as the emergence of synthetic bacteria and lab/process-on-a-chip techniques, it is now possible to design chemical “circuits”, custom organisms, micro/nanoscale swarms of devices, and a host of other new systems. This success opens up a new frontier for interdisciplinary communications techniques using chemistry, biology, and other principles that have not been considered in the communications literature. The IEEE Transactions on Molecular, Biological, and Multi-Scale Communications (T-MBMSC) is devoted to the principles, design, and analysis of communication systems that use physics beyond classical electromagnetism. This includes molecular, quantum, and other physical, chemical and biological techniques; as well as new communication techniques at small scales or across multiple scales (e.g., nano to micro to macro; note that strictly nanoscale systems, 1-100 nm, are outside the scope of this journal). Original research articles on one or more of the following topics are within scope: mathematical modeling, information/communication and network theoretic analysis, standardization and industrial applications, and analytical or experimental studies on communication processes or networks in biology. Contributions on related topics may also be considered for publication. Contributions from researchers outside the IEEE’s typical audience are encouraged.