Stream Sampling Framework and Application for Frequency Cap Statistics

ACM Transactions on Algorithms (TALG) Pub Date : 2018-09-24 DOI:10.1145/3234338

E. Cohen

{"title":"Stream Sampling Framework and Application for Frequency Cap Statistics","authors":"E. Cohen","doi":"10.1145/3234338","DOIUrl":null,"url":null,"abstract":"Unaggregated data, in a streamed or distributed form, are prevalent and come from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries), and elements with different keys interleave. Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T. Random samples can be very effective for quick and efficient estimation of statistics at query time. Ideally, to estimate statistics for a given function f, our sample would include a key with frequency w with probability roughly proportional to f(w). The challenge is that while such “gold-standard” samples can be easily computed after aggregating the data (computing the set of key-frequency pairs), this aggregation is costly: It requires structure of size that is proportional to the number of active keys, which can be very large. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and structure size proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our ℓ-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics and statistical guarantees on quality that are close to gold standard for cap statistics with T=Θ (ℓ). Furthermore, our multi-objective samples provide these statistical guarantees on quality for all concave sub-linear statistics (the nonnegative span of cap functions) while incurring only a logarithmic overhead on sample size.","PeriodicalId":154047,"journal":{"name":"ACM Transactions on Algorithms (TALG)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Algorithms (TALG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3234338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Unaggregated data, in a streamed or distributed form, are prevalent and come from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries), and elements with different keys interleave. Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T. Random samples can be very effective for quick and efficient estimation of statistics at query time. Ideally, to estimate statistics for a given function f, our sample would include a key with frequency w with probability roughly proportional to f(w). The challenge is that while such “gold-standard” samples can be easily computed after aggregating the data (computing the set of key-frequency pairs), this aggregation is costly: It requires structure of size that is proportional to the number of active keys, which can be very large. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and structure size proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our ℓ-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics and statistical guarantees on quality that are close to gold standard for cap statistics with T=Θ (ℓ). Furthermore, our multi-objective samples provide these statistical guarantees on quality for all concave sub-linear statistics (the nonnegative span of cap functions) while incurring only a logarithmic overhead on sample size.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

频率帽统计的流采样框架及应用

以流或分布式形式出现的未聚合数据非常普遍，它们来自不同的来源，例如用户与web服务和IP流量的交互。数据元素有键(cookie、用户、查询)，具有不同键的元素相互交错。对此类数据的分析通常使用的统计数据表示为函数f的指定段中键的总和，该函数f应用于键的频率(出现的总次数)。其中Distinct是段中活动键的个数，Sum是它们频率的和，两者都是频率上限统计的特殊情况，它们通过参数t来限制频率。随机样本可以非常有效地在查询时快速有效地估计统计信息。理想情况下，为了估计给定函数f的统计信息，我们的样本将包含频率为w的键，其概率大致与f(w)成比例。挑战在于，虽然这种“黄金标准”样本可以在聚合数据(计算键-频率对的集合)之后轻松计算出来，但这种聚合的成本很高:它需要与活动键的数量成比例的结构，而活动键的数量可能非常大。我们提出了一个非聚合数据的采样框架，它使用单次传递(对于流)或两次传递(对于分布式数据)，结构大小与所需的样本量成比例。我们的设计结合了Distinct和Sum的经典解决方案。具体来说，我们的上限样本提供了任何单调非递减频率统计量的非负无偏估计和质量的统计保证，这些估计接近于T=Θ (r)的上限统计量的金标准。此外，我们的多目标样本为所有凹次线性统计(cap函数的非负跨度)提供了这些质量的统计保证，同时只产生对样本量的对数开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Algorithms (TALG)

自引率

0.00%

发文量

期刊最新文献

Generic Techniques for Building Top-k Structures Deterministic Leader Election in Anonymous Radio Networks A Learned Approach to Design Compressed Rank/Select Data Structures k-apices of Minor-closed Graph Classes. II. Parameterized Algorithms Fully Dynamic (Δ +1)-Coloring in O(1) Update Time