Strategies for using additional resources in parallel hash-based join algorithms

Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004. Pub Date : 2004-06-04 DOI:10.1109/HPDC.2004.34

Xi Zhang, T. Kurç, T. Pan, Ümit V. Çatalyürek, S. Narayanan, P. Wyckoff, J. Saltz

{"title":"Strategies for using additional resources in parallel hash-based join algorithms","authors":"Xi Zhang, T. Kurç, T. Pan, Ümit V. Çatalyürek, S. Narayanan, P. Wyckoff, J. Saltz","doi":"10.1109/HPDC.2004.34","DOIUrl":null,"url":null,"abstract":"Hash-based join is a compute- and memory-intensive algorithm. It achieves good performance and scales well to large datasets, if sufficient memory is available to hold the hash table and the distribution of computing had across nodes is balanced. We compare three adaptive algorithms that start with a partitioning of the hash table across a group of nodes and expand during the hash table building phase to additional resources, when memory on a node is used up. The split-based algorithm partitions the hash table range assigned to the node, on which memory is full, into two segments and assigns one of the segments to a new node in the system. The replication-based algorithm replicates the hash table range on a new node. The hybrid algorithm combines the first and second strategies in order to address each strategy's short comings. We perform an experimental performance evaluation of these algorithms on a PC cluster. Our results show that among the three algorithms, in most cases the hybrid algorithm either performs close to the better of the two or is the best algorithm.","PeriodicalId":446429,"journal":{"name":"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPDC.2004.34","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Hash-based join is a compute- and memory-intensive algorithm. It achieves good performance and scales well to large datasets, if sufficient memory is available to hold the hash table and the distribution of computing had across nodes is balanced. We compare three adaptive algorithms that start with a partitioning of the hash table across a group of nodes and expand during the hash table building phase to additional resources, when memory on a node is used up. The split-based algorithm partitions the hash table range assigned to the node, on which memory is full, into two segments and assigns one of the segments to a new node in the system. The replication-based algorithm replicates the hash table range on a new node. The hybrid algorithm combines the first and second strategies in order to address each strategy's short comings. We perform an experimental performance evaluation of these algorithms on a PC cluster. Our results show that among the three algorithms, in most cases the hybrid algorithm either performs close to the better of the two or is the best algorithm.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在基于哈希的并行连接算法中使用额外资源的策略

基于哈希的连接是一种计算和内存密集型算法。如果有足够的内存来保存哈希表，并且跨节点的计算分布是平衡的，那么它可以实现良好的性能并很好地扩展到大型数据集。我们比较了三种自适应算法，它们从跨一组节点对哈希表进行分区开始，并在哈希表构建阶段扩展到节点上的内存用尽时使用额外的资源。基于分割的算法将分配给内存满的节点的哈希表范围划分为两个段，并将其中一个段分配给系统中的新节点。基于复制的算法在新节点上复制哈希表范围。混合算法将第一种策略和第二种策略结合起来，以解决每种策略的缺点。我们在PC集群上对这些算法进行了实验性能评估。结果表明，在三种算法中，大多数情况下混合算法的性能接近于两者中的较好算法，或者是最好的算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.

自引率

0.00%

发文量

期刊最新文献

Measuring and understanding user comfort with resource borrowing Globus and PlanetLab resource management solutions compared FPN: a distributed hash table for commercial applications GAIS: grid advanced information service based on P2P mechanism Utilization of a local grid of Mac OS X-based computers using Xgrid