WavingSketch: an unbiased and generic sketch for finding top-k items in data streams

Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang
{"title":"WavingSketch: an unbiased and generic sketch for finding top-k items in data streams","authors":"Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang","doi":"10.1007/s00778-024-00869-6","DOIUrl":null,"url":null,"abstract":"<p>Finding top-<i>k</i> items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-<i>k</i> algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-<i>k</i> frequent items, finding top-<i>k</i> heavy changes, finding top-<i>k</i> persistent items, finding top-<i>k</i> Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves <span>\\(10 \\times \\)</span> faster speed and <span>\\(10^3 \\times \\)</span> smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-024-00869-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Finding top-k items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-k algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, finding top-k Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves \(10 \times \) faster speed and \(10^3 \times \) smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
WavingSketch:用于在数据流中查找前 k 项的无偏通用草图
查找数据流中的顶 k 项是数据挖掘中的一个基本问题。无偏估计是 top-k 算法公认的优雅而重要的特性。在本文中,我们提出了一种名为 WavingSketch 的新型草图算法,它比现有的无偏算法更加精确。我们从理论上证明了 WavingSketch 可以提供无偏估计,并推导出其误差边界。WavingSketch 是测量任务的通用算法,我们将其应用于五个应用领域:查找前 k 个频繁项、查找前 k 个重大变化项、查找前 k 个持久项、查找前 k 个超级传播者以及连接-聚合估计。我们的实验结果表明,与最先进的无偏空间节省法相比,WavingSketch在寻找频繁项方面的速度更快,误差更小。在其他应用中,WavingSketch 也实现了更高的准确率和更快的速度。所有相关代码都在 GitHub 上开源(https://github.com/WavingSketch/Waving-Sketch)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A versatile framework for attributed network clustering via K-nearest neighbor augmentation Discovering critical vertices for reinforcement of large-scale bipartite networks DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search Enabling space-time efficient range queries with REncoder AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1