Localized suffix array and its application to genome mapping problems for paired-end short reads.

Genome informatics. International Conference on Genome Informatics Pub Date : 2009-10-01

Kouichi Kimura, Asako Koike

{"title":"Localized suffix array and its application to genome mapping problems for paired-end short reads.","authors":"Kouichi Kimura, Asako Koike","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"60-71"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome informatics. International Conference on Genome Informatics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.

微信好友朋友圈 QQ好友复制链接

本刊更多论文

局部后缀阵列及其在配对短读基因组定位中的应用。

本文介绍了一种新的数据结构——局部后缀数组，在此基础上，文本搜索应用中的出现信息被动态地表示为全局位置信息和本地字典顺序信息的组合。对于在给定距离内搜索一对单词，许多共享粗粒度全局位置的候选位置可以像在传统后缀数组中一样，按照本地字典顺序紧凑地表示，并且可以在粗粒度分辨率下同时检查它们是否违反距离约束。位置和字典信息之间的权衡逐渐向更精细的位置分辨率转移，并相应地重新检查距离约束。因此，即使每个单词有大量的出现，配对搜索也可以有效地执行。本地化后缀数组本身实际上是对传统后缀数组内的位重新排序，它们的内存需求本质上是相同的。我们展示了新一代DNA测序仪产生的对端短读的基因组定位问题的应用。当成对读取高度重复时，naïvely计算、排序和比较所有坐标非常耗时。对于36个碱基对的人类基因组重测序数据，在几乎一半的配对读取的冗余总和(个体出现的数量)大于2000的情况下，观察到比naïve方法加速10倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Genome informatics. International Conference on Genome Informatics

自引率

0.00%

发文量

期刊最新文献

Docking-calculation-based method for predicting protein-RNA interactions. Sign: large-scale gene network estimation environment for high performance computing. Linear regression models predicting strength of transcriptional activity of promoters. Database for crude drugs and Kampo medicine. Mechanism of cell cycle disruption by multiple p53 pulses.