打破压缩后缀数组和后缀树构建中的 O(n)- 障碍

IF 1.5 Q4 ELECTROCHEMISTRY International Journal of Corrosion Pub Date : 2023-01-01 DOI:10.1137/1.9781611977554.ch187
Dominik Kempa, Tomasz Kociumaka
{"title":"打破压缩后缀数组和后缀树构建中的 O(n)- 障碍","authors":"Dominik Kempa, Tomasz Kociumaka","doi":"10.1137/1.9781611977554.ch187","DOIUrl":null,"url":null,"abstract":"<p><p>The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length-<math><mi>n</mi></math> text, however, they use <math><mi>Θ</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>n</mi><mo>)</mo></math> bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the <i>compressed suffix array</i> (CSA) and the <i>FM-index</i>. Sadakane [SODA 2002] then showed how to augment them to obtain the <i>compressed suffix tree</i> (CST). For a length-<math><mi>n</mi></math> text over an alphabet of size <math><mi>σ</mi></math>, these structures use only <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>)</mo></math> time [Hon et al., FOCS 2003], which is <math><mi>Θ</mi><mfenced><mrow><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> factor away from the lower bound of <math><mi>Ω</mi><mfenced><mrow><mi>n</mi><mo>/</mo><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in <math><mi>n</mi></math> for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit <math><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits, support SA queries and full suffix tree functionality in <math><mi>𝒪</mi><mfenced><mrow><msup><mrow><mtext>log</mtext></mrow><mrow><mi>ϵ</mi></mrow></msup><mi>n</mi></mrow></mfenced></math> time per operation, and can be constructed in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>min</mtext><mo>(</mo><mn>1</mn><mo>,</mo><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>)</mo></math> time using <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits of working space. (For example, if <math><mi>σ</mi><mo>=</mo><mn>2</mn></math>, the construction time is <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>=</mo><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>.) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call <i>prefix rank</i> and <i>prefix selection</i>. Using the novel techniques, we also develop a new index for pattern matching.</p>","PeriodicalId":13893,"journal":{"name":"International Journal of Corrosion","volume":"2010 1","pages":"5122-5202"},"PeriodicalIF":1.5000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149104/pdf/","citationCount":"0","resultStr":"{\"title\":\"<ArticleTitle xmlns:ns0=\\\"http://www.w3.org/1998/Math/MathML\\\">Breaking the <ns0:math><ns0:mi>O</ns0:mi><ns0:mo>(</ns0:mo><ns0:mi>n</ns0:mi><ns0:mo>)</ns0:mo></ns0:math>-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.\",\"authors\":\"Dominik Kempa, Tomasz Kociumaka\",\"doi\":\"10.1137/1.9781611977554.ch187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length-<math><mi>n</mi></math> text, however, they use <math><mi>Θ</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>n</mi><mo>)</mo></math> bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the <i>compressed suffix array</i> (CSA) and the <i>FM-index</i>. Sadakane [SODA 2002] then showed how to augment them to obtain the <i>compressed suffix tree</i> (CST). For a length-<math><mi>n</mi></math> text over an alphabet of size <math><mi>σ</mi></math>, these structures use only <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>)</mo></math> time [Hon et al., FOCS 2003], which is <math><mi>Θ</mi><mfenced><mrow><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> factor away from the lower bound of <math><mi>Ω</mi><mfenced><mrow><mi>n</mi><mo>/</mo><msub><mrow><mtext>log</mtext></mrow><mrow><mi>σ</mi></mrow></msub><mi>n</mi></mrow></mfenced></math> (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in <math><mi>n</mi></math> for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit <math><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits, support SA queries and full suffix tree functionality in <math><mi>𝒪</mi><mfenced><mrow><msup><mrow><mtext>log</mtext></mrow><mrow><mi>ϵ</mi></mrow></msup><mi>n</mi></mrow></mfenced></math> time per operation, and can be constructed in <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>min</mtext><mo>(</mo><mn>1</mn><mo>,</mo><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>)</mo></math> time using <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mspace></mspace><mtext>log</mtext><mspace></mspace><mi>σ</mi><mo>)</mo></math> bits of working space. (For example, if <math><mi>σ</mi><mo>=</mo><mn>2</mn></math>, the construction time is <math><mi>𝒪</mi><mo>(</mo><mi>n</mi><mo>/</mo><msqrt><mtext>log</mtext><mspace></mspace><mi>n</mi></msqrt><mo>)</mo><mo>=</mo><mi>o</mi><mo>(</mo><mi>n</mi><mo>)</mo></math>.) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call <i>prefix rank</i> and <i>prefix selection</i>. Using the novel techniques, we also develop a new index for pattern matching.</p>\",\"PeriodicalId\":13893,\"journal\":{\"name\":\"International Journal of Corrosion\",\"volume\":\"2010 1\",\"pages\":\"5122-5202\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149104/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Corrosion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1137/1.9781611977554.ch187\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ELECTROCHEMISTRY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Corrosion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/1.9781611977554.ch187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ELECTROCHEMISTRY","Score":null,"Total":0}
引用次数: 0

摘要

后缀数组(描述给定文本后缀的词典顺序)和后缀树(所有后缀的路径压缩三元组)是字符串处理中最基本的两种数据结构,在数据压缩、生物信息学和信息检索中有着大量应用。然而,对于长度为 n 的文本,它们需要使用 Θ(nlogn) 位空间,成本往往过高。为了解决这个问题,Grossi 和 Vitter [STOC 2000] 以及 Ferragina 和 Manzini [FOCS 2000] 分别推出了空间效率高的后缀数组版本,即压缩后缀数组(CSA)和调频索引。随后,Sadakane [SODA 2002] 展示了如何对它们进行扩充,以获得压缩后缀树 (CST)。对于大小为 σ 的字母表上长度为 n 的文本,这些结构只使用了 𝒪(nlogσ) 位。如今,这些结构已成为标准工具箱的一部分:现代教科书用几十页的篇幅介绍了它们的应用,在对空间要求极高的应用中,它们几乎完全取代了后缀数组和后缀树。目前最大的悬而未决的问题是如何高效地构建它们。二十年过去了,最快的算法仍然需要 ᵊ(n)时间[Hon等人,FOCS 2003],这与Ωn/logσn的下限相差Θlogσn倍(因为必须读取输入)。在本文中,我们提出了一种新的压缩后缀数组和新的压缩后缀树,在 20 年内首次改进了这一问题的 n 值,它们采用了 o(n)-time 的构造算法,同时与原始 CSA/CST 和 FM-index 的空间边界和查询时间相匹配。更确切地说,我们的结构占用ᵊ(nlogσ)比特,支持 SA 查询和完整的后缀树功能,每次操作只需要ᵊlogϵn 时间,使用ᵊ(nmin(1,logσ/logn))比特的工作空间,可以在ᵊ(nmin(1,logσ/logn))时间内构建。(例如,如果 σ=2,构造时间为 ᵊ(n/logn)=o(n))。我们将这一结果作为一个更为普遍的推论:我们证明,压缩后缀数组/树的所有参数(查询时间、空间、构造时间和构造工作空间)基本上都可以简化为回答新查询类型的数据结构的参数,我们称之为前缀秩和前缀选择。利用新技术,我们还开发了一种新的模式匹配索引。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.

The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length-n text, however, they use Θ(nlogn) bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length-n text over an alphabet of size σ, these structures use only 𝒪(nlogσ) bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in 𝒪(n) time [Hon et al., FOCS 2003], which is Θlogσn factor away from the lower bound of Ωn/logσn (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in n for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit o(n)-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take 𝒪(nlogσ) bits, support SA queries and full suffix tree functionality in 𝒪logϵn time per operation, and can be constructed in 𝒪(nmin(1,logσ/logn)) time using 𝒪(nlogσ) bits of working space. (For example, if σ=2, the construction time is 𝒪(n/logn)=o(n).) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
5.70
自引率
0.00%
发文量
8
审稿时长
14 weeks
期刊最新文献
Investigation of Wall Thickness, Corrosion, and Deposits in Industrial Pipelines Using Radiographic Technique Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees. Sodium Citrate as an Environmentally Friendly Corrosion Inhibitor of Steel in a Neutral Environment Effects of Mineral Admixtures on Macrocell Corrosion Behaviors of Steel Bars in Chloride-Contaminated Concrete Computational and Experimental Evaluation of Inhibition Potential of a New Ecologically Friendly Inhibitor Leaves of Date Palm (Phoenix dactylifera L.) for Aluminium Corrosion in an Acidic Media
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1