Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Journal of the ACM (JACM) Pub Date : 2018-09-08 DOI:10.1145/3375890

T. Gagie, G. Navarro, N. Prezza

{"title":"Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space","authors":"T. Gagie, G. Navarro, N. Prezza","doi":"10.1145/3375890","DOIUrl":null,"url":null,"abstract":"Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O(m log log n) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space. By raising the space to O(r log log n), our index counts the occurrences in optimal time, O(m), and locates them in optimal time as well, O(m + occ). By further raising the space by an O(w/ log σ) factor, where σ is the alphabet size and w = Ω (log n) is the RAM machine size in bits, we support count and locate in O(⌈ m log (σ)/w ⌉) and O(⌈ m log (σ)/w ⌉ + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log (n/r)) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O(log (n/r)+ℓ log (σ)/w). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O(log (n/r)), and extend these capabilities to full suffix tree functionality, typically in O(log (n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.","PeriodicalId":17199,"journal":{"name":"Journal of the ACM (JACM)","volume":"70 1","pages":"1 - 54"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"129","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the ACM (JACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3375890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 129

Abstract

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O(m log log n) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space. By raising the space to O(r log log n), our index counts the occurrences in optimal time, O(m), and locates them in optimal time as well, O(m + occ). By further raising the space by an O(w/ log σ) factor, where σ is the alphabet size and w = Ω (log n) is the RAM machine size in bits, we support count and locate in O(⌈ m log (σ)/w ⌉) and O(⌈ m log (σ)/w ⌉ + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log (n/r)) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O(log (n/r)+ℓ log (σ)/w). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O(log (n/r)), and extend these capabilities to full suffix tree functionality, typically in O(log (n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

全功能后缀树与bwt运行有界空间中的最优文本搜索

自世纪之交以来，索引高度重复的文本(如基因组数据库、软件库和版本化的文本集合)已成为一个重要问题。对于重复文本，一个相关的可压缩性度量是r，即它们的Burrows-Wheeler变换(BWTs)的运行次数。最早用于重复集合的索引之一是Run-Length FM-index，它使用O(r)空间，能够有效地计算长度为m的模式在长度为n的文本中出现的次数(使用当前技术需要O(m log log n)时间)。然而，它无法在以r为界的空间内有效地定位这些事件的位置。在本文中，我们解决了这个长期存在的问题，展示了如何扩展Run-Length FM-index，以便它能够在O(r)空间内有效地定位occ事件(在O(occ log log n)时间内)。通过将空间提升到O(r log log n)，我们的索引计算在最优时间(O(m))内出现的次数，并将它们定位在最优时间(O(m + occ))内。通过将空间进一步提高一个O(w/ log σ)因子，其中σ为字母表大小，w = Ω (log n)为RAM机器大小(以位为单位)，我们支持计数和定位于O(≤≤m log (σ)/w²)和O(≤≤m log (σ)/w²+ occ)时间，这在填充设置中是最优的，在压缩空间中是没有得到的。我们还描述了一个使用O(r log (n/r))空间的结构，该结构替换文本并在几乎最优的时间O(log (n/r)+ r log (σ)/w)内提取长度为r的任何文本子串。在该空间内，我们同样提供对任意后缀数组、逆后缀数组和最长公共前缀数组单元的访问，时间为O(log (n/r))，并将这些功能扩展为完整的后缀树功能，每次操作通常需要O(log (n/r))时间。我们的实验表明，我们的O(r)空间指数在时间上优于空间竞争方案1- 2个数量级。原始fm指数的竞争实现在空间上优于1- 2个数量级，在时间上优于2- 3个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of the ACM (JACM)

自引率

0.00%

发文量

期刊最新文献

Synchronization Strings: Codes for Insertions and Deletions Approaching the Singleton Bound The Reachability Problem for Two-Dimensional Vector Addition Systems with States Invited Articles Foreword On Nonconvex Optimization for Machine Learning Exploiting Spontaneous Transmissions for Broadcasting and Leader Election in Radio Networks