The complexity of DNA sequences. Different approaches and definitions

V. Gusev, L. A. Miroshnichenko
{"title":"The complexity of DNA sequences. Different approaches and definitions","authors":"V. Gusev, L. A. Miroshnichenko","doi":"10.17537/2020.15.313","DOIUrl":null,"url":null,"abstract":"\nAn important quantitative characteristic of symbolic sequence (texts, strings) is complexity, which reflects at the intuitive level the degree of their \"non-randomness\". A.N. Kolmogorov formulated the most general definition of complexity. He proposed measuring the complexity of an object (symbolic sequence) by the length of the shortest descriptions by which this object can be uniquely reconstructed. Since there is no program guaranteed to search for the shortest description, in practice, various algorithmic approximations considered in this paper are used for this purpose. Along with definitions of complexity, suggesting the possibility of reconstruction a sequence from its \"description\", a number of measures are considered that do not imply such restoration. They are based on the calculation of some quantitative characteristics. Of interest is not only a quantitative assessment of complexity, but also the identification and classification of structural regularities that determine its specific value. In one form or another, they are expressed in the demonstration of repetition in the broadest sense. The considered measures of complexity are conventionally divided into statistical ones that take into account the frequency of occurrence of symbols or short “words” in the text, “dictionary” ones that estimate the number of different “subwords” and “structural” ones based on the identification of long repeating fragments of text and the determination of relationships between them.\nMost of the methods are designed for sequences of an arbitrary linguistic nature. The special attention paid to DNA sequences, reflected in the title of the article, is due to the importance of the object, manifestations of repetition of different types, and numerous examples of using the concept of complexity in solving problems of classification and evolution of various biological objects. Local structural features found in the sliding window mode in DNA sequences are of considerable interest, since zones of low complexity in the genomes of various organisms are often associated with the regulation of basic genetic processes.\n","PeriodicalId":53525,"journal":{"name":"Mathematical Biology and Bioinformatics","volume":"306 1","pages":"313-337"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17537/2020.15.313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0

Abstract

An important quantitative characteristic of symbolic sequence (texts, strings) is complexity, which reflects at the intuitive level the degree of their "non-randomness". A.N. Kolmogorov formulated the most general definition of complexity. He proposed measuring the complexity of an object (symbolic sequence) by the length of the shortest descriptions by which this object can be uniquely reconstructed. Since there is no program guaranteed to search for the shortest description, in practice, various algorithmic approximations considered in this paper are used for this purpose. Along with definitions of complexity, suggesting the possibility of reconstruction a sequence from its "description", a number of measures are considered that do not imply such restoration. They are based on the calculation of some quantitative characteristics. Of interest is not only a quantitative assessment of complexity, but also the identification and classification of structural regularities that determine its specific value. In one form or another, they are expressed in the demonstration of repetition in the broadest sense. The considered measures of complexity are conventionally divided into statistical ones that take into account the frequency of occurrence of symbols or short “words” in the text, “dictionary” ones that estimate the number of different “subwords” and “structural” ones based on the identification of long repeating fragments of text and the determination of relationships between them. Most of the methods are designed for sequences of an arbitrary linguistic nature. The special attention paid to DNA sequences, reflected in the title of the article, is due to the importance of the object, manifestations of repetition of different types, and numerous examples of using the concept of complexity in solving problems of classification and evolution of various biological objects. Local structural features found in the sliding window mode in DNA sequences are of considerable interest, since zones of low complexity in the genomes of various organisms are often associated with the regulation of basic genetic processes.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DNA序列的复杂性。不同的方法和定义
符号序列(文本、字符串)的一个重要数量特征是复杂性,它直观地反映了符号序列的“非随机性”程度。A.N.柯尔莫哥洛夫给出了复杂性最一般的定义。他提出用最短描述的长度来衡量一个物体(符号序列)的复杂性,通过最短描述的长度可以唯一地重建这个物体。由于没有保证搜索最短描述的程序,在实践中,本文考虑的各种算法近似都用于此目的。随着复杂性的定义,表明从其“描述”重建序列的可能性,考虑了一些不意味着这种恢复的措施。它们是基于一些定量特征的计算。感兴趣的不仅是对复杂性的定量评估,而且是确定其具体价值的结构规律的识别和分类。它们以一种或另一种形式表现为最广泛意义上的重复论证。考虑的复杂性度量通常分为统计度量(考虑文本中符号或短“词”的出现频率)、“字典”度量(估计不同“子词”的数量)和“结构”度量(基于识别文本的长重复片段和确定它们之间的关系)。大多数方法是为具有任意语言性质的序列而设计的。文章的标题反映了对DNA序列的特别关注,这是由于对象的重要性,不同类型重复的表现,以及在解决各种生物对象的分类和进化问题时使用复杂性概念的众多例子。在DNA序列的滑动窗口模式中发现的局部结构特征是相当有趣的,因为各种生物体基因组中的低复杂性区域通常与基本遗传过程的调节有关。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Mathematical Biology and Bioinformatics
Mathematical Biology and Bioinformatics Mathematics-Applied Mathematics
CiteScore
1.10
自引率
0.00%
发文量
13
期刊最新文献
Modeling Growth and Photoadaptation of Porphyridium purpureum Batch Culture Mathematical Modeling of the Initial Period of Spread of HIV-1 Infection in the Lymphatic Node Mathematical Model of Closed Microecosystem “Algae – Heterotrophic Bacteria” Using a Drug Repurposing Strategy to Virtually Screen Potential HIV-1 Entry Inhibitors That Block the NHR Domain of the Viral Envelope Protein gp41 Applying Laplace Transformation on Epidemiological Models as Caputo Derivatives
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1