Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser
{"title":"PIEJoin: Towards Parallel Set Containment Joins","authors":"Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser","doi":"10.1145/2949689.2949694","DOIUrl":null,"url":null,"abstract":"The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2949689.2949694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.