Computing the multi-string BWT and LCP array in external memory

IF 1 4区 计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS Theoretical Computer Science Pub Date : 2021-03-16 DOI:10.1016/j.tcs.2020.11.041
Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi
{"title":"Computing the multi-string BWT and LCP array in external memory","authors":"Paola Bonizzoni,&nbsp;Gianluca Della Vedova,&nbsp;Yuri Pirola,&nbsp;Marco Previtali,&nbsp;Raffaella Rizzi","doi":"10.1016/j.tcs.2020.11.041","DOIUrl":null,"url":null,"abstract":"<div><p>Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes.</p><p>In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of <em>m</em> strings of different lengths. The algorithm over a set of strings having constant length <em>k</em> has <span><math><mi>O</mi><mo>(</mo><mi>m</mi><mi>k</mi><mi>l</mi><mo>)</mo></math></span> time and I/O volume, using <span><math><mi>O</mi><mo>(</mo><mi>k</mi><mo>+</mo><mi>m</mi><mo>)</mo></math></span> main memory, where <em>l</em> is the maximum value in the LCP array.</p></div>","PeriodicalId":49438,"journal":{"name":"Theoretical Computer Science","volume":"862 ","pages":"Pages 42-58"},"PeriodicalIF":1.0000,"publicationDate":"2021-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.tcs.2020.11.041","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theoretical Computer Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0304397520306885","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 4

Abstract

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes.

In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k+m) main memory, where l is the maximum value in the LCP array.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
计算外部存储器中的多字符串BWT和LCP数组
索引非常大的字符串集合,例如由广泛的下一代测序技术产生的字符串集合,严重依赖于Burrows-Wheeler变换(BWT)的多字符串泛化:内存方法的大量需求刺激了外部内存算法的最新发展。计算一组字符串的最长公共前缀(LCP)数组的相关问题有助于计算字符串之间的后缀-前缀重叠,这是许多基因组组装算法的重要步骤。在之前的一篇论文中,我们提出了一种用于构建BWT和LCP的内存分治方法,其中我们将部分BWT与前向排序后缀的方法合并。在本文中,我们提出了一种替代的向后策略来开发一种外部存储方法,以同时在m个不同长度的字符串集合上构建BWT和LCP数组。在一组长度为k的字符串上的算法有O(mkl)时间和I/O体积,使用O(k+m)主内存,其中l是LCP数组中的最大值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Theoretical Computer Science
Theoretical Computer Science 工程技术-计算机:理论方法
CiteScore
2.60
自引率
18.20%
发文量
471
审稿时长
12.6 months
期刊介绍: Theoretical Computer Science is mathematical and abstract in spirit, but it derives its motivation from practical and everyday computation. Its aim is to understand the nature of computation and, as a consequence of this understanding, provide more efficient methodologies. All papers introducing or studying mathematical, logic and formal concepts and methods are welcome, provided that their motivation is clearly drawn from the field of computing.
期刊最新文献
Weakly leveled planarity with bounded span Respecting lower bounds in uniform lower and upper bounded facility location problem Efficient algorithms for the interval maximum coverage problem Editorial Board Finding trails in multigraphs with restricted transitions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1