SWOOP: top-k similarity joins over set streams.

IF 2.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Vldb Journal Pub Date : 2025-01-01 Epub Date: 2024-12-23 DOI:10.1007/s00778-024-00880-x

Willi Mann, Nikolaus Augsten, Christian S Jensen, Mateusz Pawlik

{"title":"SWOOP: top-k similarity joins over set streams.","authors":"Willi Mann, Nikolaus Augsten, Christian S Jensen, Mateusz Pawlik","doi":"10.1007/s00778-024-00880-x","DOIUrl":null,"url":null,"abstract":"We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-k result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-k result pair with any set already in the window. When a set leaves the window, all its pairings in the top-k result must be replaced with other pairs. It is therefore not sufficient to maintain the k most similar pairs since less similar pairs may become top-k pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-k result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.","PeriodicalId":49373,"journal":{"name":"Vldb Journal","volume":"34 1","pages":"13"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11666680/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vldb Journal","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00778-024-00880-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-k result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-k result pair with any set already in the window. When a set leaves the window, all its pairings in the top-k result must be replaced with other pairs. It is therefore not sufficient to maintain the k most similar pairs since less similar pairs may become top-k pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-k result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SWOOP：集合流上的top-k相似连接。

我们为应用程序提供了高效的支持，这些应用程序的目标是在快速流中不断地找到相似的集合对，比如Twitter流，它以单词集的形式发出tweet。使用滑动窗口模型，当新集进入窗口或现有集离开窗口时，top-k结果会发生变化。具体来说，当一个集合到达时，它可能会与窗口中已经存在的任何集合形成一个新的top-k结果对。当一个集合离开窗口时，它在前k个结果中的所有配对都必须用其他配对替换。因此，维持k对最相似的配对是不够的，因为不太相似的配对可能会在以后成为前k对配对。我们提出了SWOOP，一个高度可扩展的流连接算法。新颖的索引技术和复杂的过滤器有效地修剪过时的对，因为新的集进入窗口。SWOOP增量地维护一个可证明的最小相似对库存，以便随时更新top-k结果。实证研究证实，SWOOP能够支持比现有方法支持的速率快几个数量级的流速率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Vldb Journal 工程技术-计算机：信息系统

CiteScore

12.30

自引率

4.80%

发文量

审稿时长

>12 weeks

期刊介绍： The journal is dedicated to the publication of scholarly contributions in areas of data management such as database system technology and information systems, including their architectures and applications. Further, the journal’s scope is restricted to areas of data management that are covered by the combined expertise of the journal’s editorial board. Submissions with a substantial theory component are welcome, but the VLDB Journal expects such submissions also to embody a systems component. In relation to data mining, the journal will handle submissions where systems issues play a significant role. Factors that we use to determine whether a data mining paper is within scope include: The submission targets systems issues in relation to data mining, e.g., by covering integration with a database engine or with other data management functionality. The submission’s contributions build on (rather than simply cite) work already published in database outlets, e.g., VLDBJ, ACM TODS, PVLDB, ACM SIGMOD, IEEE ICDE, EDBT. The journal''s editorial board has the necessary expertise on the submission''s topic. Traditional, stand-alone data mining papers that lack the above or similar characteristics are out of scope for this journal. Criteria similar to the above are applied to submission from other areas, e.g., information retrieval and geographical information systems.

期刊最新文献

SWOOP: top-k similarity joins over set streams. Optimizing RPQs over a compact graph representation Cardinality estimation using normalizing flow MinJoin++: a fast algorithm for string similarity joins under edit distance Tabular data synthesis with generative adversarial networks: design space and optimizations