Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi
{"title":"Computing the multi-string BWT and LCP array in external memory","authors":"Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi","doi":"10.1016/j.tcs.2020.11.041","DOIUrl":null,"url":null,"abstract":"<div><p>Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes.</p><p>In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of <em>m</em> strings of different lengths. The algorithm over a set of strings having constant length <em>k</em> has <span><math><mi>O</mi><mo>(</mo><mi>m</mi><mi>k</mi><mi>l</mi><mo>)</mo></math></span> time and I/O volume, using <span><math><mi>O</mi><mo>(</mo><mi>k</mi><mo>+</mo><mi>m</mi><mo>)</mo></math></span> main memory, where <em>l</em> is the maximum value in the LCP array.</p></div>","PeriodicalId":49438,"journal":{"name":"Theoretical Computer Science","volume":"862 ","pages":"Pages 42-58"},"PeriodicalIF":1.0000,"publicationDate":"2021-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.tcs.2020.11.041","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theoretical Computer Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0304397520306885","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 4
Abstract
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes.
In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has time and I/O volume, using main memory, where l is the maximum value in the LCP array.
期刊介绍:
Theoretical Computer Science is mathematical and abstract in spirit, but it derives its motivation from practical and everyday computation. Its aim is to understand the nature of computation and, as a consequence of this understanding, provide more efficient methodologies. All papers introducing or studying mathematical, logic and formal concepts and methods are welcome, provided that their motivation is clearly drawn from the field of computing.