Text Indexing for Long Patterns: Anchors are All you Need

Proc. VLDB Endow. Pub Date : 2023-05-01 DOI:10.14778/3598581.3598586

Lorraine A. K. Ayad, G. Loukides, S. Pissis

{"title":"Text Indexing for Long Patterns: Anchors are All you Need","authors":"Lorraine A. K. Ayad, G. Loukides, S. Pissis","doi":"10.14778/3598581.3598586","DOIUrl":null,"url":null,"abstract":"\n In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to\n simultaneously\n enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are:\n (i)\n index space;\n (ii)\n query time;\n (iii)\n construction space; and\n (iv)\n construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound\n l\n on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by:\n (i)\n designing an average-case linear-time algorithm to compute bd-anchors; and\n (ii)\n developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"34 1","pages":"2117-2131"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3598581.3598586","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

长模式的文本索引:锚是你所需要的

在许多现实世界的数据库系统中，很大一部分数据由字符串表示:一些字母表上的字母序列。这是因为字符串可以很容易地编码来自不同来源的数据。以紧凑的形式表示这样的字符串数据集通常很重要，但同时也要启用快速模式匹配查询。这是典型的文本索引问题。任何人在设计或实现文本索引时都应该注意的四个绝对措施是:(i)索引空间;(ii)查询时间;(三)建筑空间;(四)施工时间。然而，不幸的是，大多数(如果不是全部的话)广泛使用的索引(例如，后缀树、后缀数组或它们的压缩对应项)并没有同时针对所有四种度量进行优化，因为很难做到所有四种度量的最佳。在这里，我们向这个方向迈出了重要的一步，通过展示使用本地一致锚点(lc-锚点)的文本索引在所有四个度量中都提供了非常好的性能，当我们有查询模式长度的下界l时——这在实际应用中可以说是一个相当合理的假设。具体来说，我们改进了Loukides和Pissis提出的基于双向字符串锚点(bd-anchor)的索引构建方法，这是一种新型的lc锚点，我们设计了一种计算bd-anchor的平均情况线性时间算法;(ii)开发一种半外部内存实现，使用接近最优的工作在小空间中构建索引。然后，我们提出了一个广泛的实验评估，基于四个措施，使用真实的基准数据集。结果表明，对于长模式，使用改进算法构建的索引优于所有经典索引:(压缩)后缀树;(压缩)后缀数组;以及fm指数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量

期刊最新文献

Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing Utility-aware Payment Channel Network Rebalance Relational Query Synthesis ⋈ Decision Tree Learning Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach Query Refinement for Diversity Constraint Satisfaction