显著降低了自然DNA序列的熵估计

Proceedings DCC '97. Data Compression Conference Pub Date : 1997-03-25 DOI:10.1109/DCC.1997.581998

D. Loewenstern, P. Yianilos

{"title":"显著降低了自然DNA序列的熵估计","authors":"D. Loewenstern, P. Yianilos","doi":"10.1109/DCC.1997.581998","DOIUrl":null,"url":null,"abstract":"If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using expectation maximization (EM).","PeriodicalId":403990,"journal":{"name":"Proceedings DCC '97. Data Compression Conference","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"136","resultStr":"{\"title\":\"Significantly lower entropy estimates for natural DNA sequences\",\"authors\":\"D. Loewenstern, P. Yianilos\",\"doi\":\"10.1109/DCC.1997.581998\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using expectation maximization (EM).\",\"PeriodicalId\":403990,\"journal\":{\"name\":\"Proceedings DCC '97. Data Compression Conference\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"136\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '97. Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1997.581998\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '97. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1997.581998","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 136

摘要

如果DNA是由字母表{a,C,G,T}组成的随机字符串，那么最佳编码将为每个核苷酸分配2位。我们想象DNA是一个高度有序的、有目的的分子，因此可以合理地期望它的字符串表示的统计模型产生更低的熵估计。令人惊讶的是，许多天然DNA序列，包括部分人类基因组，都不是这种情况。我们介绍了一个新的统计模型(压缩算法)，迄今为止最强的报道，自然发生的DNA序列。传统技术编码一个核苷酸所用的比特数(1.90)比仅依靠单个核苷酸的频率统计(1.95)所获得的比特数略少。我们的方法在某些情况下将这一差距增加了五倍以上(1.66)，并可能在微生物模式识别应用中带来更好的性能。我们的主要贡献之一，也是这些改进的主要来源，是在模型中正式包含了不精确匹配信息。在不同距离上存在的匹配形成一个专家小组，然后将其组合成一个单一的预测。该组合结构新颖，参数学习方法采用期望最大化方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Significantly lower entropy estimates for natural DNA sequences

If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using expectation maximization (EM).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings DCC '97. Data Compression Conference

自引率

0.00%

发文量

期刊最新文献

Robust image coding with perceptual-based scalability Image coding based on mixture modeling of wavelet coefficients and a fast estimation-quantization framework Region-based video coding with embedded zero-trees Progressive Ziv-Lempel encoding of synthetic images Compressing address trace data for cache simulations