SWOOP: top-k similarity joins over set streams.

IF 2.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Vldb Journal Pub Date : 2025-01-01 Epub Date: 2024-12-23 DOI:10.1007/s00778-024-00880-x
Willi Mann, Nikolaus Augsten, Christian S Jensen, Mateusz Pawlik
{"title":"SWOOP: top-k similarity joins over set streams.","authors":"Willi Mann, Nikolaus Augsten, Christian S Jensen, Mateusz Pawlik","doi":"10.1007/s00778-024-00880-x","DOIUrl":null,"url":null,"abstract":"<p><p>We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-<i>k</i> result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-<i>k</i> result pair with any set already in the window. When a set leaves the window, all its pairings in the top-<i>k</i> result must be replaced with other pairs. It is therefore not sufficient to maintain the <i>k</i> most similar pairs since less similar pairs may become top-<i>k</i> pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-<i>k</i> result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.</p>","PeriodicalId":49373,"journal":{"name":"Vldb Journal","volume":"34 1","pages":"13"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11666680/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vldb Journal","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00778-024-00880-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-k result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-k result pair with any set already in the window. When a set leaves the window, all its pairings in the top-k result must be replaced with other pairs. It is therefore not sufficient to maintain the k most similar pairs since less similar pairs may become top-k pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-k result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SWOOP:集合流上的top-k相似连接。
我们为应用程序提供了高效的支持,这些应用程序的目标是在快速流中不断地找到相似的集合对,比如Twitter流,它以单词集的形式发出tweet。使用滑动窗口模型,当新集进入窗口或现有集离开窗口时,top-k结果会发生变化。具体来说,当一个集合到达时,它可能会与窗口中已经存在的任何集合形成一个新的top-k结果对。当一个集合离开窗口时,它在前k个结果中的所有配对都必须用其他配对替换。因此,维持k对最相似的配对是不够的,因为不太相似的配对可能会在以后成为前k对配对。我们提出了SWOOP,一个高度可扩展的流连接算法。新颖的索引技术和复杂的过滤器有效地修剪过时的对,因为新的集进入窗口。SWOOP增量地维护一个可证明的最小相似对库存,以便随时更新top-k结果。实证研究证实,SWOOP能够支持比现有方法支持的速率快几个数量级的流速率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Vldb Journal
Vldb Journal 工程技术-计算机:信息系统
CiteScore
12.30
自引率
4.80%
发文量
55
审稿时长
>12 weeks
期刊介绍: The journal is dedicated to the publication of scholarly contributions in areas of data management such as database system technology and information systems, including their architectures and applications. Further, the journal’s scope is restricted to areas of data management that are covered by the combined expertise of the journal’s editorial board. Submissions with a substantial theory component are welcome, but the VLDB Journal expects such submissions also to embody a systems component. In relation to data mining, the journal will handle submissions where systems issues play a significant role. Factors that we use to determine whether a data mining paper is within scope include: The submission targets systems issues in relation to data mining, e.g., by covering integration with a database engine or with other data management functionality. The submission’s contributions build on (rather than simply cite) work already published in database outlets, e.g., VLDBJ, ACM TODS, PVLDB, ACM SIGMOD, IEEE ICDE, EDBT. The journal''s editorial board has the necessary expertise on the submission''s topic. Traditional, stand-alone data mining papers that lack the above or similar characteristics are out of scope for this journal. Criteria similar to the above are applied to submission from other areas, e.g., information retrieval and geographical information systems.
期刊最新文献
SWOOP: top-k similarity joins over set streams. Optimizing RPQs over a compact graph representation Cardinality estimation using normalizing flow MinJoin++: a fast algorithm for string similarity joins under edit distance Tabular data synthesis with generative adversarial networks: design space and optimizations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1