Localized suffix array and its application to genome mapping problems for paired-end short reads.

Kouichi Kimura, Asako Koike
{"title":"Localized suffix array and its application to genome mapping problems for paired-end short reads.","authors":"Kouichi Kimura,&nbsp;Asako Koike","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome informatics. International Conference on Genome Informatics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.

分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
局部后缀阵列及其在配对短读基因组定位中的应用。
本文介绍了一种新的数据结构——局部后缀数组,在此基础上,文本搜索应用中的出现信息被动态地表示为全局位置信息和本地字典顺序信息的组合。对于在给定距离内搜索一对单词,许多共享粗粒度全局位置的候选位置可以像在传统后缀数组中一样,按照本地字典顺序紧凑地表示,并且可以在粗粒度分辨率下同时检查它们是否违反距离约束。位置和字典信息之间的权衡逐渐向更精细的位置分辨率转移,并相应地重新检查距离约束。因此,即使每个单词有大量的出现,配对搜索也可以有效地执行。本地化后缀数组本身实际上是对传统后缀数组内的位重新排序,它们的内存需求本质上是相同的。我们展示了新一代DNA测序仪产生的对端短读的基因组定位问题的应用。当成对读取高度重复时,naïvely计算、排序和比较所有坐标非常耗时。对于36个碱基对的人类基因组重测序数据,在几乎一半的配对读取的冗余总和(个体出现的数量)大于2000的情况下,观察到比naïve方法加速10倍以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Docking-calculation-based method for predicting protein-RNA interactions. Sign: large-scale gene network estimation environment for high performance computing. Linear regression models predicting strength of transcriptional activity of promoters. Database for crude drugs and Kampo medicine. Mechanism of cell cycle disruption by multiple p53 pulses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1