提高主存哈希连接中数据倾斜弹性的研究

Puya Memarzia, S. Ray, V. Bhavsar
{"title":"提高主存哈希连接中数据倾斜弹性的研究","authors":"Puya Memarzia, S. Ray, V. Bhavsar","doi":"10.1145/3216122.3216156","DOIUrl":null,"url":null,"abstract":"Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On Improving Data Skew Resilience In Main-memory Hash Joins\",\"authors\":\"Puya Memarzia, S. Ray, V. Bhavsar\",\"doi\":\"10.1145/3216122.3216156\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.\",\"PeriodicalId\":422509,\"journal\":{\"name\":\"Proceedings of the 22nd International Database Engineering & Applications Symposium\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd International Database Engineering & Applications Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3216122.3216156\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Database Engineering & Applications Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3216122.3216156","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

主内存哈希连接是内存连接的一个重要类别。然而,这些连接的性能可能会受到数据集倾斜、变换和负载平衡的影响。我们对数据集倾斜对四种散列连接算法的影响进行了全面的研究。我们展示了哈希连接受到数据集倾斜的严重影响,并且对打乱的数据性能会变得更差。为了解决这些问题,我们建议使用两个不同的哈希表进行非分区哈希连接。首先,我们使用一个单独的链散列表,该列表基于我们修改过的现有实现。这个版本在倾斜数据集上的性能比原来的实现高出三个数量级。其次,我们提出了一种新的哈希表用于哈希连接,称为Maple哈希表。我们证明了这个哈希表更适合倾斜和/或洗牌的数据集。此外,这种方法进一步提高了17.3倍的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
On Improving Data Skew Resilience In Main-memory Hash Joins
Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Data Mining Ancient Script Image Data Using Convolutional Neural Networks CELPB: A Cache Invalidation Policy for Location Dependent Data in Mobile Environment Efficient Big Data Clustering The Science of Science and a Multilayer Network Approach to Scientists' Ranking WalDis: Mining Discriminative Patterns within Dynamic Graphs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1