Timely Reporting of Heavy Hitters Using External Memory

Shikha Singh, P. Pandey, M. A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, C. Phillips
{"title":"Timely Reporting of Heavy Hitters Using External Memory","authors":"Shikha Singh, P. Pandey, M. A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, C. Phillips","doi":"10.1145/3472392","DOIUrl":null,"url":null,"abstract":"Given an input stream S of size N, a ɸ-heavy hitter is an item that occurs at least ɸN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"7 1","pages":"1 - 35"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems (TODS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3472392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Given an input stream S of size N, a ɸ-heavy hitter is an item that occurs at least ɸN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
及时报告使用外部内存的重量级人物
给定大小为N的输入流S,一个重敲子是一个在S中出现至少N次的项。查找重敲子的问题在数据库文献中得到了广泛的研究。我们研究了一个实时重磅变体,其中一个元素必须在我们看到它的T = h n次出现后不久报告(因此它成为重磅变体)。我们称之为及时事件检测(TED)问题。TED问题模拟了许多现实世界监测系统的需求,这些系统要求准确(即,无假阴性)和及时地从具有低报告阈值(高灵敏度)的大型高速流中报告所有事件。像经典的重量级人物问题一样,解决TED问题而不出现误报需要很大的空间(Ω (N)个单词)。因此,在ram中,重量级算法通常会牺牲准确性(即允许误报)、灵敏度或及时性(即使用多次传递)。我们展示了如何在保证准确性、灵敏度和及时性的同时,将重量级算法应用于外部存储器,以解决大型高速流上的TED问题。我们的数据结构仅受I/O带宽(而不是延迟)的限制,并支持在报告延迟和I/O开销之间进行可调的权衡。由于报告延迟很小,我们的算法只会产生对数级的I/O开销。我们使用Firehose流基准来实现和验证我们的数据结构。我们结构的多线程版本可以扩展到每秒处理11M个观测值,然后才会受到CPU限制。相比之下,将标准的重量级算法简单地应用于外部存储器将受到存储设备随机I/O吞吐量的限制,即每秒≈100K的观察值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
On Finding Rank Regret Representatives Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration Persistent Summaries Influence Maximization Revisited: Efficient Sampling with Bound Tightened The Space-Efficient Core of Vadalog
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1