Willi Mann, Nikolaus Augsten, Christian S Jensen, Mateusz Pawlik
{"title":"SWOOP: top-k similarity joins over set streams.","authors":"Willi Mann, Nikolaus Augsten, Christian S Jensen, Mateusz Pawlik","doi":"10.1007/s00778-024-00880-x","DOIUrl":null,"url":null,"abstract":"<p><p>We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-<i>k</i> result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-<i>k</i> result pair with any set already in the window. When a set leaves the window, all its pairings in the top-<i>k</i> result must be replaced with other pairs. It is therefore not sufficient to maintain the <i>k</i> most similar pairs since less similar pairs may become top-<i>k</i> pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-<i>k</i> result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.</p>","PeriodicalId":49373,"journal":{"name":"Vldb Journal","volume":"34 1","pages":"13"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11666680/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vldb Journal","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00778-024-00880-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-k result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-k result pair with any set already in the window. When a set leaves the window, all its pairings in the top-k result must be replaced with other pairs. It is therefore not sufficient to maintain the k most similar pairs since less similar pairs may become top-k pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-k result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.
期刊介绍:
The journal is dedicated to the publication of scholarly contributions in areas of data management such as database system technology and information systems, including their architectures and applications. Further, the journal’s scope is restricted to areas of data management that are covered by the combined expertise of the journal’s editorial board.
Submissions with a substantial theory component are welcome, but the VLDB Journal expects such submissions also to embody a systems component.
In relation to data mining, the journal will handle submissions where systems issues play a significant role. Factors that we use to determine whether a data mining paper is within scope include:
The submission targets systems issues in relation to data mining, e.g., by covering integration with a database engine or with other data management functionality.
The submission’s contributions build on (rather than simply cite) work already published in database outlets, e.g., VLDBJ, ACM TODS, PVLDB, ACM SIGMOD, IEEE ICDE, EDBT.
The journal''s editorial board has the necessary expertise on the submission''s topic.
Traditional, stand-alone data mining papers that lack the above or similar characteristics are out of scope for this journal. Criteria similar to the above are applied to submission from other areas, e.g., information retrieval and geographical information systems.