Xi Zhang, T. Kurç, T. Pan, Ümit V. Çatalyürek, S. Narayanan, P. Wyckoff, J. Saltz
{"title":"在基于哈希的并行连接算法中使用额外资源的策略","authors":"Xi Zhang, T. Kurç, T. Pan, Ümit V. Çatalyürek, S. Narayanan, P. Wyckoff, J. Saltz","doi":"10.1109/HPDC.2004.34","DOIUrl":null,"url":null,"abstract":"Hash-based join is a compute- and memory-intensive algorithm. It achieves good performance and scales well to large datasets, if sufficient memory is available to hold the hash table and the distribution of computing had across nodes is balanced. We compare three adaptive algorithms that start with a partitioning of the hash table across a group of nodes and expand during the hash table building phase to additional resources, when memory on a node is used up. The split-based algorithm partitions the hash table range assigned to the node, on which memory is full, into two segments and assigns one of the segments to a new node in the system. The replication-based algorithm replicates the hash table range on a new node. The hybrid algorithm combines the first and second strategies in order to address each strategy's short comings. We perform an experimental performance evaluation of these algorithms on a PC cluster. Our results show that among the three algorithms, in most cases the hybrid algorithm either performs close to the better of the two or is the best algorithm.","PeriodicalId":446429,"journal":{"name":"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Strategies for using additional resources in parallel hash-based join algorithms\",\"authors\":\"Xi Zhang, T. Kurç, T. Pan, Ümit V. Çatalyürek, S. Narayanan, P. Wyckoff, J. Saltz\",\"doi\":\"10.1109/HPDC.2004.34\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hash-based join is a compute- and memory-intensive algorithm. It achieves good performance and scales well to large datasets, if sufficient memory is available to hold the hash table and the distribution of computing had across nodes is balanced. We compare three adaptive algorithms that start with a partitioning of the hash table across a group of nodes and expand during the hash table building phase to additional resources, when memory on a node is used up. The split-based algorithm partitions the hash table range assigned to the node, on which memory is full, into two segments and assigns one of the segments to a new node in the system. The replication-based algorithm replicates the hash table range on a new node. The hybrid algorithm combines the first and second strategies in order to address each strategy's short comings. We perform an experimental performance evaluation of these algorithms on a PC cluster. Our results show that among the three algorithms, in most cases the hybrid algorithm either performs close to the better of the two or is the best algorithm.\",\"PeriodicalId\":446429,\"journal\":{\"name\":\"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPDC.2004.34\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPDC.2004.34","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Strategies for using additional resources in parallel hash-based join algorithms
Hash-based join is a compute- and memory-intensive algorithm. It achieves good performance and scales well to large datasets, if sufficient memory is available to hold the hash table and the distribution of computing had across nodes is balanced. We compare three adaptive algorithms that start with a partitioning of the hash table across a group of nodes and expand during the hash table building phase to additional resources, when memory on a node is used up. The split-based algorithm partitions the hash table range assigned to the node, on which memory is full, into two segments and assigns one of the segments to a new node in the system. The replication-based algorithm replicates the hash table range on a new node. The hybrid algorithm combines the first and second strategies in order to address each strategy's short comings. We perform an experimental performance evaluation of these algorithms on a PC cluster. Our results show that among the three algorithms, in most cases the hybrid algorithm either performs close to the better of the two or is the best algorithm.