Two-level massive string dictionaries

IF 3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Systems Pub Date : 2024-11-08 DOI:10.1016/j.is.2024.102490

Paolo Ferragina, Mariagiovanna Rotundo, Giorgio Vinciguerra

{"title":"Two-level massive string dictionaries","authors":"Paolo Ferragina, Mariagiovanna Rotundo, Giorgio Vinciguerra","doi":"10.1016/j.is.2024.102490","DOIUrl":null,"url":null,"abstract":"<div><div>We study the problem of engineering space–time efficient data structures that support membership and rank queries on <em>very</em> large static dictionaries of strings.</div><div>Our solution is based on a very simple approach that decouples string storage and string indexing by means of a block-wise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. On top of this, we design an in-memory cache that, given a sample of the query workload, augments the Patricia trie with additional information to reduce the number of I/Os of future queries.</div><div>Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries, compared to Patricia tries, do not provide significant benefits when used in a large-scale indexing setting, and (ii) our two-level approach enables the indexing and storage of 3.5 billion strings taking 273 GB in just less than 200 MB of internal memory and 83 GB of compressed disk space, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future design.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"128 ","pages":"Article 102490"},"PeriodicalIF":3.0000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924001480","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

We study the problem of engineering space–time efficient data structures that support membership and rank queries on very large static dictionaries of strings.

Our solution is based on a very simple approach that decouples string storage and string indexing by means of a block-wise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. On top of this, we design an in-memory cache that, given a sample of the query workload, augments the Patricia trie with additional information to reduce the number of I/Os of future queries.

Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries, compared to Patricia tries, do not provide significant benefits when used in a large-scale indexing setting, and (ii) our two-level approach enables the indexing and storage of 3.5 billion strings taking 273 GB in just less than 200 MB of internal memory and 83 GB of compressed disk space, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future design.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

两级海量字符串词典

我们的解决方案基于一种非常简单的方法，即通过分块压缩排序字典字符串（存储在外部内存中）以及在每个分块的第一个字符串上简洁地实现帕特里夏三元组（存储在内部内存中），将字符串存储和字符串索引分离开来。在此基础上，我们设计了一个内存缓存，在给定查询工作量样本的情况下，利用附加信息增强 Patricia 三元组，以减少未来查询的 I/O 次数。我们在两个新数据集上进行的实验评估表明：(i) 与 Patricia tries 相比，最先进的压缩字符串字典在大规模索引设置中使用时没有显著优势；(ii) 我们的双层方法能够索引和存储 35 亿个字符串，耗时 273 GB。(ii) 我们的双层方法只需不到 200 MB 的内部内存和 83 GB 的压缩磁盘空间，就能索引和存储 35 亿条字符串，总容量达 273 GB，同时还能保证查询性能与 RocksDB 等现代存储系统中使用的基于阵列的解决方案相当或更快，从而可能影响其未来的设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.

期刊最新文献

Discovering partially ordered workflow models Learning policies for resource allocation in business processes STracker: A framework for identifying sentiment changes in customer feedbacks Two-level massive string dictionaries A generative and discriminative model for diversity-promoting recommendation