字典压缩的根源:字符串吸引子

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing Pub Date : 2017-10-30 DOI:10.1145/3188745.3188814

Dominik Kempa, N. Prezza

{"title":"字典压缩的根源:字符串吸引子","authors":"Dominik Kempa, N. Prezza","doi":"10.1145/3188745.3188814","DOIUrl":null,"url":null,"abstract":"A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this fact, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text’s size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all distinct text’s substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and allows us to uncover new asymptotic relations between the output sizes of different dictionary compressors. We then show that the k-attractor problem — deciding whether a text has a size-t set of positions capturing all substrings of length at most k — is NP-complete for k≥ 3. This, in particular, includes the full string attractor problem. We provide several approximation techniques for the smallest k-attractor, show that the problem is APX-complete for constant k, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore essentially closes (at once) the random access problem for all these compressors.","PeriodicalId":20593,"journal":{"name":"Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"95","resultStr":"{\"title\":\"At the roots of dictionary compression: string attractors\",\"authors\":\"Dominik Kempa, N. Prezza\",\"doi\":\"10.1145/3188745.3188814\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this fact, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text’s size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all distinct text’s substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and allows us to uncover new asymptotic relations between the output sizes of different dictionary compressors. We then show that the k-attractor problem — deciding whether a text has a size-t set of positions capturing all substrings of length at most k — is NP-complete for k≥ 3. This, in particular, includes the full string attractor problem. We provide several approximation techniques for the smallest k-attractor, show that the problem is APX-complete for constant k, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore essentially closes (at once) the random access problem for all these compressors.\",\"PeriodicalId\":20593,\"journal\":{\"name\":\"Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"95\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3188745.3188814\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3188745.3188814","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 95

摘要

在无损文本压缩领域中一个众所周知的事实是，当输入包含长重复时，高阶熵是一个弱模型。在这一事实的推动下，几十年的研究已经产生了无数所谓的字典压缩器:能够通过利用文本的重复来减少文本大小的算法。Lempel-Ziv 77是这类工具中最成功和最著名的工具之一，紧随其后的是直线规划、运行长度Burrows-Wheeler变换、宏观方案、拼贴系统和紧致有向无环字图。在本文中，我们展示了这些技术是同一个优雅的组合问题的不同解决方案:找到捕获所有不同文本子字符串的一小组位置。我们称这样的集合为弦吸引子。我们首先展示了字典压缩器和字符串吸引器之间的约简。这给出了字典压缩器相对于最小字符串吸引子的近似比率，并允许我们揭示不同字典压缩器输出大小之间的新渐近关系。然后我们证明了k吸引子问题——决定一个文本是否有一个大小为t的位置集合，捕获了长度不超过k的所有子串——对于k≥3是np完全的。特别地，这包括了全弦吸引子问题。我们提供了几种最小k吸引子的逼近方法，证明了对于常数k问题是apx完备的，并给出了强的不可逼近性结果。最后，我们给出了字符串吸引子随机存取问题的匹配下界和上界。通过展示在最优时间内支持查询的数据结构来证明上界。我们的数据结构是通用的:通过对字符串吸引子的简化，它支持对任何字典压缩方案的随机访问。特别是，它还匹配LZ77、直线程序、拼贴系统和宏方案的下界，因此基本上(立即)解决了所有这些压缩器的随机访问问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

At the roots of dictionary compression: string attractors

A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this fact, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text’s size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all distinct text’s substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and allows us to uncover new asymptotic relations between the output sizes of different dictionary compressors. We then show that the k-attractor problem — deciding whether a text has a size-t set of positions capturing all substrings of length at most k — is NP-complete for k≥ 3. This, in particular, includes the full string attractor problem. We provide several approximation techniques for the smallest k-attractor, show that the problem is APX-complete for constant k, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore essentially closes (at once) the random access problem for all these compressors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing

自引率

0.00%

发文量

期刊最新文献

Data-dependent hashing via nonlinear spectral gaps Interactive compression to external information The query complexity of graph isomorphism: bypassing distribution testing lower bounds Collusion resistant traitor tracing from learning with errors Explicit binary tree codes with polylogarithmic size alphabet