Indexing Compressed Text: A Tale of Time and Space (Invited Talk)

Bulletin of the Society of Sea Water Science, Japan Pub Date : 2020-01-01 DOI:10.4230/LIPIcs.SEA.2020.3

N. Prezza

{"title":"Indexing Compressed Text: A Tale of Time and Space (Invited Talk)","authors":"N. Prezza","doi":"10.4230/LIPIcs.SEA.2020.3","DOIUrl":null,"url":null,"abstract":"Text indexing is a classical algorithmic problem that has been studied for over four decades. The earliest optimal-time solution to the problem, the suffix tree [11], dates back to 1973 and requires up to two orders of magnitude more space than the text to be stored. In the year 2000, two breakthrough works [6, 3] showed that this space overhead is not necessary: both the index and the text can be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: nowadays, the two most widely-used DNA aligners employ compressed indexes [9, 8]. In recent years, it became apparent that entropy had reached its limits: modern datasets (for example, collections of thousands of human genomes) are extremely large but very repetitive and, by its very definition, entropy cannot compress repetitive texts [7]. To overcome this problem, a new generation of indexes based on dictionary compressors (for example, LZ77 and run-length BWT) emerged [7, 5, 1], together with generalizations of the indexing problem to labeled graphs [2, 10, 4]. This talk is a short and friendly survey of the landmarks of this fascinating path that took us from suffix trees to the most modern compressed indexes on labeled graphs. 2012 ACM Subject Classification Theory of computation → Data compression; Theory of computation → Sorting and searching; Theory of computation → Pattern matching","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"261 1","pages":"3:1-3:2"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.SEA.2020.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text indexing is a classical algorithmic problem that has been studied for over four decades. The earliest optimal-time solution to the problem, the suffix tree [11], dates back to 1973 and requires up to two orders of magnitude more space than the text to be stored. In the year 2000, two breakthrough works [6, 3] showed that this space overhead is not necessary: both the index and the text can be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: nowadays, the two most widely-used DNA aligners employ compressed indexes [9, 8]. In recent years, it became apparent that entropy had reached its limits: modern datasets (for example, collections of thousands of human genomes) are extremely large but very repetitive and, by its very definition, entropy cannot compress repetitive texts [7]. To overcome this problem, a new generation of indexes based on dictionary compressors (for example, LZ77 and run-length BWT) emerged [7, 5, 1], together with generalizations of the indexing problem to labeled graphs [2, 10, 4]. This talk is a short and friendly survey of the landmarks of this fascinating path that took us from suffix trees to the most modern compressed indexes on labeled graphs. 2012 ACM Subject Classification Theory of computation → Data compression; Theory of computation → Sorting and searching; Theory of computation → Pattern matching

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

索引压缩文本:时间与空间的故事(特邀演讲)

文本索引是一个经典的算法问题，已经被研究了四十多年。该问题最早的最优时间解决方案是后缀树[11]，它可以追溯到1973年，需要比文本存储多两个数量级的空间。2000年，两个突破性的作品[6,3]表明，这种空间开销是不必要的:索引和文本都可以存储在与文本熵成比例的空间中。这些贡献对生物信息学产生了巨大的影响:如今，两种最广泛使用的DNA比对器采用压缩索引[9,8]。近年来，熵已经明显达到了极限:现代数据集(例如，数千个人类基因组的集合)非常大，但非常重复，并且，根据其定义，熵不能压缩重复的文本[7]。为了克服这个问题，基于字典压缩器(例如LZ77和游程长度BWT)的新一代索引出现了[7,5,1]，以及对标记图的索引问题的推广[2,10,4]。这个演讲是一个简短而友好的调查，这条迷人的道路将我们从后缀树带到了标签图上最现代的压缩索引。2012 ACM学科分类:计算理论→数据压缩;计算理论→排序与搜索;计算理论→模式匹配

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bulletin of the Society of Sea Water Science, Japan

自引率

0.00%

发文量

期刊最新文献

Efficient Yao Graph Construction Partitioning the Bags of a Tree Decomposition Into Cliques Arc-Flags Meet Trip-Based Public Transit Routing Maximum Coverage in Sublinear Space, Faster FREIGHT: Fast Streaming Hypergraph Partitioning