Locality-sensitive operators for parallel main-memory database clusters

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI:10.1109/ICDE.2014.6816684

Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann

{"title":"Locality-sensitive operators for parallel main-memory database clusters","authors":"Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2014.6816684","DOIUrl":null,"url":null,"abstract":"The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816684","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 61

Abstract

The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于并行主存数据库集群的位置敏感操作符

在过去的几十年里，计算速度的增长已经超过了网络带宽的增长。这导致本地处理和分布式处理之间的性能差距越来越大。因此，并行数据库集群必须最大化查询处理的局部性。实现这一目的的常用技术是共分区关系，以避免在网络上进行昂贵的数据变换。但是，这仅限于每个关系的一个属性，并且在面对更新时维护成本很高。由于与分布键的相关性，其他属性通常表现出模糊的共定位，但目前的方法没有利用这一点。在本文中，我们引入了位置敏感的数据变换，它可以大大减少分布式操作(如连接和聚合)的网络通信量。我们提出了四种新技术:(i)最优分区分配利用局部性来减少网络相位持续时间;(ii)通信调度避免了由于交叉流量导致的带宽利用率不足;(iii)自适应基数分区在数据重分区期间保持局域性，并优雅地处理值倾斜;(iv)选择性广播减少了存在极端值偏差或大量重复的网络通信。我们提供了全面的实验结果，表明我们的技术可以将模糊共定位的性能提高5倍，对于具有值偏差的输入可以提高3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 IEEE 30th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Managing uncertainty in spatial and spatio-temporal data Locality-sensitive operators for parallel main-memory database clusters KnowLife: A knowledge graph for health and life sciences We can learn your #hashtags: Connecting tweets to explicit topics A demonstration of MNTG - A web-based road network traffic generator