打破压缩后缀数组和后缀树构建中的 O(n)- 障碍

IF 1.5 Q4 ELECTROCHEMISTRY International Journal of Corrosion Pub Date : 2023-01-01 DOI:10.1137/1.9781611977554.ch187

Dominik Kempa, Tomasz Kociumaka

{"title":"打破压缩后缀数组和后缀树构建中的 O(n)- 障碍","authors":"Dominik Kempa, Tomasz Kociumaka","doi":"10.1137/1.9781611977554.ch187","DOIUrl":null,"url":null,"abstract":"The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length-<math><mi>n</mi></math> text, however, they use <math><mi>Θ</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>n</mi><mo>)</mo></math> bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length-<math><mi>n</mi></math> text over an alphabet of size <math><mi>σ</mi></math>, these structures use only <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>)</mo></math> time [Hon et al., FOCS 2003], which is <math><mi>Θ</mi><mfenced><mrow><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> factor away from the lower bound of <math><mi>Ω</mi><mfenced><mrow><mi>n</mi><mo>/</mo><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in <math><mi>n</mi></math> for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit <math><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits, support SA queries and full suffix tree functionality in <math><mi>𝒪</mi><mfenced><mrow><msup><mrow><mtext>log</mtext></mrow><mrow><mi>ϵ</mi></mrow></msup><mi>n</mi></mrow></mfenced></math> time per operation, and can be constructed in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>min</mtext><mo>(</mo><mn>1</mn><mo>,</mo><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>)</mo></math> time using <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits of working space. (For example, if <math><mi>σ</mi><mo>=</mo><mn>2</mn></math>, the construction time is <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>=</mo><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>.) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching.","PeriodicalId":13893,"journal":{"name":"International Journal of Corrosion","volume":"2010 1","pages":"5122-5202"},"PeriodicalIF":1.5000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149104/pdf/","citationCount":"0","resultStr":"{\"title\":\"<ArticleTitle xmlns:ns0=\\\"http://www.w3.org/1998/Math/MathML\\\">Breaking the <ns0:math><ns0:mi>O</ns0:mi><ns0:mo>(</ns0:mo><ns0:mi>n</ns0:mi><ns0:mo>)</ns0:mo></ns0:math>-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.\",\"authors\":\"Dominik Kempa, Tomasz Kociumaka\",\"doi\":\"10.1137/1.9781611977554.ch187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length-<math><mi>n</mi></math> text, however, they use <math><mi>Θ</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>n</mi><mo>)</mo></math> bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length-<math><mi>n</mi></math> text over an alphabet of size <math><mi>σ</mi></math>, these structures use only <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>)</mo></math> time [Hon et al., FOCS 2003], which is <math><mi>Θ</mi><mfenced><mrow><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> factor away from the lower bound of <math><mi>Ω</mi><mfenced><mrow><mi>n</mi><mo>/</mo><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in <math><mi>n</mi></math> for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit <math><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits, support SA queries and full suffix tree functionality in <math><mi>𝒪</mi><mfenced><mrow><msup><mrow><mtext>log</mtext></mrow><mrow><mi>ϵ</mi></mrow></msup><mi>n</mi></mrow></mfenced></math> time per operation, and can be constructed in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>min</mtext><mo>(</mo><mn>1</mn><mo>,</mo><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>)</mo></math> time using <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits of working space. (For example, if <math><mi>σ</mi><mo>=</mo><mn>2</mn></math>, the construction time is <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>=</mo><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>.) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching.\",\"PeriodicalId\":13893,\"journal\":{\"name\":\"International Journal of Corrosion\",\"volume\":\"2010 1\",\"pages\":\"5122-5202\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149104/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Corrosion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1137/1.9781611977554.ch187\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ELECTROCHEMISTRY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Corrosion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/1.9781611977554.ch187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ELECTROCHEMISTRY","Score":null,"Total":0}

引用次数: 0

摘要

后缀数组（描述给定文本后缀的词典顺序）和后缀树（所有后缀的路径压缩三元组）是字符串处理中最基本的两种数据结构，在数据压缩、生物信息学和信息检索中有着大量应用。然而，对于长度为 n 的文本，它们需要使用 Θ(nlogn) 位空间，成本往往过高。为了解决这个问题，Grossi 和 Vitter [STOC 2000] 以及 Ferragina 和 Manzini [FOCS 2000] 分别推出了空间效率高的后缀数组版本，即压缩后缀数组（CSA）和调频索引。随后，Sadakane [SODA 2002] 展示了如何对它们进行扩充，以获得压缩后缀树 (CST)。对于大小为 σ 的字母表上长度为 n 的文本，这些结构只使用了 𝒪(nlogσ) 位。如今，这些结构已成为标准工具箱的一部分：现代教科书用几十页的篇幅介绍了它们的应用，在对空间要求极高的应用中，它们几乎完全取代了后缀数组和后缀树。目前最大的悬而未决的问题是如何高效地构建它们。二十年过去了，最快的算法仍然需要 ᵊ(n)时间[Hon等人，FOCS 2003]，这与Ωn/logσn的下限相差Θlogσn倍（因为必须读取输入）。在本文中，我们提出了一种新的压缩后缀数组和新的压缩后缀树，在 20 年内首次改进了这一问题的 n 值，它们采用了 o(n)-time 的构造算法，同时与原始 CSA/CST 和 FM-index 的空间边界和查询时间相匹配。更确切地说，我们的结构占用ᵊ(nlogσ)比特，支持 SA 查询和完整的后缀树功能，每次操作只需要ᵊlogϵn 时间，使用ᵊ(nmin(1,logσ/logn))比特的工作空间，可以在ᵊ(nmin(1,logσ/logn))时间内构建。(例如，如果 σ=2，构造时间为 ᵊ(n/logn)=o(n)）。我们将这一结果作为一个更为普遍的推论：我们证明，压缩后缀数组/树的所有参数（查询时间、空间、构造时间和构造工作空间）基本上都可以简化为回答新查询类型的数据结构的参数，我们称之为前缀秩和前缀选择。利用新技术，我们还开发了一种新的模式匹配索引。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.

The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length- $n$ text, however, they use $Θ (n log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length- $n$ text over an alphabet of size $σ$ , these structures use only $𝒪 (n log σ)$ bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in $𝒪 (n)$ time [Hon et al., FOCS 2003], which is $Θ ({log}_{σ} n)$ factor away from the lower bound of $Ω (n / {log}_{σ} n)$ (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in $n$ for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit $o (n)$ -time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take $𝒪 (n log σ)$ bits, support SA queries and full suffix tree functionality in $𝒪 ({log}^{ϵ} n)$ time per operation, and can be constructed in $𝒪 (n min (1, log σ / \sqrt{log n}))$ time using $𝒪 (n log σ)$ bits of working space. (For example, if $σ = 2$ , the construction time is $𝒪 (n / \sqrt{log n}) = o (n)$ .) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching.