Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs

Sepehr Assadi, Nirmit Joshi, M. Prabhu, Vihan Shah
{"title":"Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs","authors":"Sepehr Assadi, Nirmit Joshi, M. Prabhu, Vihan Shah","doi":"10.48550/arXiv.2303.06288","DOIUrl":null,"url":null,"abstract":"Estimating quantiles, like the median or percentiles, is a fundamental task in data mining and data science. A (streaming) quantile summary is a data structure that can process a set S of n elements in a streaming fashion and at the end, for any phi in (0,1], return a phi-quantile of S up to an eps error, i.e., return a phi'-quantile with phi'=phi +- eps. We are particularly interested in comparison-based summaries that only compare elements of the universe under a total ordering and are otherwise completely oblivious of the universe. The best known deterministic quantile summary is the 20-year old Greenwald-Khanna (GK) summary that uses O((1/eps) log(eps n)) space [SIGMOD'01]. This bound was recently proved to be optimal for all deterministic comparison-based summaries by Cormode and Vesle\\'y [PODS'20]. In this paper, we study weighted quantiles, a generalization of the quantiles problem, where each element arrives with a positive integer weight which denotes the number of copies of that element being inserted. The only known method of handling weighted inputs via GK summaries is the naive approach of breaking each weighted element into multiple unweighted items and feeding them one by one to the summary, which results in a prohibitively large update time (proportional to the maximum weight of input elements). We give the first non-trivial extension of GK summaries for weighted inputs and show that it takes O((1/eps) log(eps n)) space and O(log(1/eps)+ log log(eps n)) update time per element to process a stream of length n (under some quite mild assumptions on the range of weights and eps). En route to this, we also simplify the original GK summaries for unweighted quantiles.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2303.06288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Estimating quantiles, like the median or percentiles, is a fundamental task in data mining and data science. A (streaming) quantile summary is a data structure that can process a set S of n elements in a streaming fashion and at the end, for any phi in (0,1], return a phi-quantile of S up to an eps error, i.e., return a phi'-quantile with phi'=phi +- eps. We are particularly interested in comparison-based summaries that only compare elements of the universe under a total ordering and are otherwise completely oblivious of the universe. The best known deterministic quantile summary is the 20-year old Greenwald-Khanna (GK) summary that uses O((1/eps) log(eps n)) space [SIGMOD'01]. This bound was recently proved to be optimal for all deterministic comparison-based summaries by Cormode and Vesle\'y [PODS'20]. In this paper, we study weighted quantiles, a generalization of the quantiles problem, where each element arrives with a positive integer weight which denotes the number of copies of that element being inserted. The only known method of handling weighted inputs via GK summaries is the naive approach of breaking each weighted element into multiple unweighted items and feeding them one by one to the summary, which results in a prohibitively large update time (proportional to the maximum weight of input elements). We give the first non-trivial extension of GK summaries for weighted inputs and show that it takes O((1/eps) log(eps n)) space and O(log(1/eps)+ log log(eps n)) update time per element to process a stream of length n (under some quite mild assumptions on the range of weights and eps). En route to this, we also simplify the original GK summaries for unweighted quantiles.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
加权输入的广义Greenwald-Khanna流分位数摘要
估计分位数,如中位数或百分位数,是数据挖掘和数据科学中的一项基本任务。(流式)分位数汇总是一种数据结构,它可以以流式方式处理n个元素的集合S,最后,对于(0,1]中的任何phi,返回S的phi分位数,误差为eps,即返回phi'=phi +- eps的phi'-分位数。我们对基于比较的总结特别感兴趣,这种总结只比较宇宙中总顺序下的元素,否则就完全忽略了宇宙。最著名的确定性分位数总结是已有20年历史的Greenwald-Khanna (GK)总结,它使用O((1/eps) log(eps n))空间[SIGMOD'01]。最近,Cormode和Vesle\'y [PODS'20]证明了该界对于所有基于确定性比较的总结都是最优的。在本文中,我们研究了加权分位数问题,这是分位数问题的一种推广,其中每个元素的权值为正整数,表示该元素被插入的拷贝数。通过GK摘要处理加权输入的唯一已知方法是将每个加权元素分解为多个未加权项并将它们一个接一个地提供给摘要的简单方法,这会导致非常大的更新时间(与输入元素的最大权重成正比)。我们给出了加权输入的GK摘要的第一个非平凡扩展,并表明它需要O((1/eps) log(eps n))空间和O(log(1/eps)+ log log(eps n))每个元素的更新时间来处理长度为n的流(在权重和eps范围的一些相当温和的假设下)。在此过程中,我们还简化了原始GK摘要的未加权分位数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs A Simple Algorithm for Consistent Query Answering under Primary Keys Size Bounds and Algorithms for Conjunctive Regular Path Queries Compact Data Structures Meet Databases (Invited Talk) Enumerating Subgraphs of Constant Sizes in External Memory
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1