Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

IF 1.8 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Computer Systems Pub Date : 2009-02-01 DOI:10.1145/1482619.1482621

Nikos Ntarmos, P. Triantafillou, G. Weikum

{"title":"Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets","authors":"Nikos Ntarmos, P. Triantafillou, G. Weikum","doi":"10.1145/1482619.1482621","DOIUrl":null,"url":null,"abstract":"Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"30 1","pages":"2:1-2:53"},"PeriodicalIF":1.8000,"publicationDate":"2009-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/1482619.1482621","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 24

Abstract

Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分布式散列草图:分布式多集的可伸缩、高效和准确的基数估计

对分布式系统中的项进行计数，特别是对多集的基数进行估计，对于各种各样的应用程序非常重要，并且是新兴的internet规模信息系统的基本构建块。这类应用程序的示例包括在点对点数据共享中优化查询访问计划，以及在分布式信息检索中计算数据项的重要性(等级/分数)。本文解决的一般形式问题是计算具有某些属性(例如，文件名包含“spiderman”的不同文件)的网络范围内的不同数量的项目，其中网络中的每个节点都包含任意子集，可能与其他节点的子集重叠。可行方法必须满足的关键要求是:(1)针对非常大的网络规模的可伸缩性，(2)消息开销方面的效率，(3)存储和访问的负载平衡，(4)基数估计的准确性，以及(5)应用程序的简单性和易于集成。本文为这个问题提供了DHS(分布式哈希草图)方法:一个分布式的、可扩展的、高效的、准确的多集基数估计器。DHS基于哈希草图进行概率计数，但根据分布式哈希表的原则，以明智的方式将每个计数器的位分布在网络节点上，同时要注意快速访问和聚合以及更新成本。本文讨论了各种设计选择，展示了估计精度、跳数效率和负载分配公平性之间的可调权衡。我们进一步贡献了我们所有方法的成熟的、公开可用的、开源的实现，以及针对各种设置的全面的实验评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Computer Systems 工程技术-计算机：理论方法

CiteScore

4.00

自引率

0.00%

发文量

审稿时长

1 months

期刊介绍： ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized. TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.