Design and evaluation of parallel hashing over large-scale data

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-20 DOI:10.1109/HiPC.2014.7116909

Long Cheng, S. Kotoulas, Tomas E. Ward, G. Theodoropoulos

{"title":"Design and evaluation of parallel hashing over large-scale data","authors":"Long Cheng, S. Kotoulas, Tomas E. Ward, G. Theodoropoulos","doi":"10.1109/HiPC.2014.7116909","DOIUrl":null,"url":null,"abstract":"High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大规模数据并行哈希的设计与评估

高性能分析数据处理系统通常运行在具有大量内存的服务器上。在这种环境中使用的常见数据结构是哈希表。本文主要研究用于处理大规模数据的高效并行哈希算法。目前，分布式架构上的哈希表是由本地或远程线程一次访问一个键，而共享内存方法侧重于使用多个线程访问单个表。一种相对简单的“批量手术”方法似乎被研究人员忽视了。在这项工作中，利用这种方法，我们提出了一个高级并行哈希框架，结构化并行哈希，旨在高效地处理分布式内存上的海量数据。我们对所提出的方法进行了理论分析，并描述了我们的哈希实现的设计。评估揭示了一个非常有趣的结果——所建议的直接方法可以大大优于分布式哈希方法，甚至可以提供与基于使用专用硬件谓词的共享内存超级计算机的方法相当的性能。此外，我们通过广泛的实验来描述我们的哈希实现的性能，从而允许系统开发人员为他们的高性能应用程序做出更明智的选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 21st International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量