提高主存哈希连接中数据倾斜弹性的研究

Proceedings of the 22nd International Database Engineering & Applications Symposium Pub Date : 2018-06-18 DOI:10.1145/3216122.3216156

Puya Memarzia, S. Ray, V. Bhavsar

{"title":"提高主存哈希连接中数据倾斜弹性的研究","authors":"Puya Memarzia, S. Ray, V. Bhavsar","doi":"10.1145/3216122.3216156","DOIUrl":null,"url":null,"abstract":"Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On Improving Data Skew Resilience In Main-memory Hash Joins\",\"authors\":\"Puya Memarzia, S. Ray, V. Bhavsar\",\"doi\":\"10.1145/3216122.3216156\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.\",\"PeriodicalId\":422509,\"journal\":{\"name\":\"Proceedings of the 22nd International Database Engineering & Applications Symposium\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd International Database Engineering & Applications Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3216122.3216156\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Database Engineering & Applications Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3216122.3216156","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

主内存哈希连接是内存连接的一个重要类别。然而，这些连接的性能可能会受到数据集倾斜、变换和负载平衡的影响。我们对数据集倾斜对四种散列连接算法的影响进行了全面的研究。我们展示了哈希连接受到数据集倾斜的严重影响，并且对打乱的数据性能会变得更差。为了解决这些问题，我们建议使用两个不同的哈希表进行非分区哈希连接。首先，我们使用一个单独的链散列表，该列表基于我们修改过的现有实现。这个版本在倾斜数据集上的性能比原来的实现高出三个数量级。其次，我们提出了一种新的哈希表用于哈希连接，称为Maple哈希表。我们证明了这个哈希表更适合倾斜和/或洗牌的数据集。此外，这种方法进一步提高了17.3倍的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

On Improving Data Skew Resilience In Main-memory Hash Joins

Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 22nd International Database Engineering & Applications Symposium

自引率

0.00%

发文量

期刊最新文献

Data Mining Ancient Script Image Data Using Convolutional Neural Networks CELPB: A Cache Invalidation Policy for Location Dependent Data in Mobile Environment Efficient Big Data Clustering The Science of Science and a Multilayer Network Approach to Scientists' Ranking WalDis: Mining Discriminative Patterns within Dynamic Graphs