PIEJoin: Towards Parallel Set Containment Joins

Proceedings of the 28th International Conference on Scientific and Statistical Database Management Pub Date : 2016-07-18 DOI:10.1145/2949689.2949694

Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser

{"title":"PIEJoin: Towards Parallel Set Containment Joins","authors":"Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser","doi":"10.1145/2949689.2949694","DOIUrl":null,"url":null,"abstract":"The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2949689.2949694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PIEJoin:朝向平行集合包含连接

集值属性上的集合包容连接(SCJ)的高效计算是一个被广泛研究的问题，在商业和科学领域都有广泛的应用。然而，仍然存在许多悬而未决的问题:广泛的比较评估仍然缺失，两种最新的算法尚未相互比较，项目排序顺序和数据属性对算法性能的确切影响在很大程度上仍然未知。此外，尽管现代服务器为并行化提供了充足的机会，但以前的所有工作都只考虑顺序连接算法。我们提出了一种基于交叉前缀树的计算SCJ的新算法PIEJoin，该算法在运行时构建在待连接属性上。我们还提出了一个高度优化的PIEJoin实现，该实现使用树签名来节省空间，并使用间隔标记来改善基本方法的运行时间。最重要的是，通过划分树的交叉点，PIEJoin可以很容易地并行化。对8个数据集的综合评估表明，连续形式的PIEJoin明显优于三个最重要的竞争对手中的两个(PRETTI和PRETTI+)。它通常比第三种方法LIMIT+(opj)慢，但并不总是慢，但需要的空间要少得多。我们在这里介绍的并行版本的PIEJoin实现了显著的进一步加速，但是我们的评估也表明，需要进一步的研究，因为找到划分连接的最佳方法是非常重要的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量

期刊最新文献

SMS: Stable Matching Algorithm using Skylines Graph-based modelling of query sets for differential privacy Efficient Feedback Collection for Pay-as-you-go Source Selection Multi-Assignment Single Joins for Parallel Cross-Match of Astronomic Catalogs on Heterogeneous Clusters Compact and queryable representation of raster datasets