At the roots of dictionary compression: string attractors

Dominik Kempa, N. Prezza
{"title":"At the roots of dictionary compression: string attractors","authors":"Dominik Kempa, N. Prezza","doi":"10.1145/3188745.3188814","DOIUrl":null,"url":null,"abstract":"A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this fact, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text’s size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all distinct text’s substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and allows us to uncover new asymptotic relations between the output sizes of different dictionary compressors. We then show that the k-attractor problem — deciding whether a text has a size-t set of positions capturing all substrings of length at most k — is NP-complete for k≥ 3. This, in particular, includes the full string attractor problem. We provide several approximation techniques for the smallest k-attractor, show that the problem is APX-complete for constant k, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore essentially closes (at once) the random access problem for all these compressors.","PeriodicalId":20593,"journal":{"name":"Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"95","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3188745.3188814","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 95

Abstract

A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this fact, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text’s size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all distinct text’s substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and allows us to uncover new asymptotic relations between the output sizes of different dictionary compressors. We then show that the k-attractor problem — deciding whether a text has a size-t set of positions capturing all substrings of length at most k — is NP-complete for k≥ 3. This, in particular, includes the full string attractor problem. We provide several approximation techniques for the smallest k-attractor, show that the problem is APX-complete for constant k, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore essentially closes (at once) the random access problem for all these compressors.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
字典压缩的根源:字符串吸引子
在无损文本压缩领域中一个众所周知的事实是,当输入包含长重复时,高阶熵是一个弱模型。在这一事实的推动下,几十年的研究已经产生了无数所谓的字典压缩器:能够通过利用文本的重复来减少文本大小的算法。Lempel-Ziv 77是这类工具中最成功和最著名的工具之一,紧随其后的是直线规划、运行长度Burrows-Wheeler变换、宏观方案、拼贴系统和紧致有向无环字图。在本文中,我们展示了这些技术是同一个优雅的组合问题的不同解决方案:找到捕获所有不同文本子字符串的一小组位置。我们称这样的集合为弦吸引子。我们首先展示了字典压缩器和字符串吸引器之间的约简。这给出了字典压缩器相对于最小字符串吸引子的近似比率,并允许我们揭示不同字典压缩器输出大小之间的新渐近关系。然后我们证明了k吸引子问题——决定一个文本是否有一个大小为t的位置集合,捕获了长度不超过k的所有子串——对于k≥3是np完全的。特别地,这包括了全弦吸引子问题。我们提供了几种最小k吸引子的逼近方法,证明了对于常数k问题是apx完备的,并给出了强的不可逼近性结果。最后,我们给出了字符串吸引子随机存取问题的匹配下界和上界。通过展示在最优时间内支持查询的数据结构来证明上界。我们的数据结构是通用的:通过对字符串吸引子的简化,它支持对任何字典压缩方案的随机访问。特别是,它还匹配LZ77、直线程序、拼贴系统和宏方案的下界,因此基本上(立即)解决了所有这些压缩器的随机访问问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Data-dependent hashing via nonlinear spectral gaps Interactive compression to external information The query complexity of graph isomorphism: bypassing distribution testing lower bounds Collusion resistant traitor tracing from learning with errors Explicit binary tree codes with polylogarithmic size alphabet
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1