Effective Construction of Relative Lempel-Ziv Dictionaries

Kewen Liao, M. Petri, Alistair Moffat, Anthony Wirth
{"title":"Effective Construction of Relative Lempel-Ziv Dictionaries","authors":"Kewen Liao, M. Petri, Alistair Moffat, Anthony Wirth","doi":"10.1145/2872427.2883042","DOIUrl":null,"url":null,"abstract":"Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each \"epoch\" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872427.2883042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23

Abstract

Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
相对Lempel-Ziv词典的有效构建
网络爬虫生成大量文本,由发起它们的搜索服务保留和存档。为了存储这样的数据并使存储成本最小化,同时仍然提供对压缩数据的某种程度的随机访问,高效和有效的压缩技术至关重要。相对于自适应压缩方案,相对Lempel Ziv (RLZ)方案提供了对大型压缩集合中的文档的快速解压缩和检索,甚至使用相对较小的ram常驻字典。到目前为止,RLZ压缩所需的字典是由定期从底层文档集合中采样的子字符串的连接形成的,然后以一种只保留高使用部分的方式进行修剪。在这项工作中,我们开发了新的字典设计启发式方法,基于有效构建,而不是基于修剪;我们将字典构造定义为(字符串)覆盖问题。为了避免字符串覆盖算法在大型集合上的复杂性,我们关注k-mers及其频率。首先,使用储层取样器,我们有效地识别了最常见的k-mers。然后,由于一个集合通常包含局部相似的区域,我们在每个“epoch”中选择一个片段,其k-mers在本地的覆盖分数最高。字典是由这些时代派生的片段串联而成的。我们的选择过程的灵感来自于对集合覆盖问题的贪心方法。与现有的最佳修剪方法CARE相比,该方案的施工时间相近,但压缩效果更好。在几个千兆字节的文档集合中,相对收益高达27%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MapWatch: Detecting and Monitoring International Border Personalization on Online Maps Automatic Discovery of Attribute Synonyms Using Query Logs and Table Corpora Learning Global Term Weights for Content-based Recommender Systems From Freebase to Wikidata: The Great Migration GoCAD: GPU-Assisted Online Content-Adaptive Display Power Saving for Mobile Devices in Internet Streaming
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1