Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing最新文献

英文中文

Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing 基于异构并行计算的大规模电子结构仿真加速

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2018-11-05 DOI: 10.5772/INTECHOPEN.80997

Oh-Kyoung Kwon, H. Ryu

Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phe-nomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices.

大规模电子结构模拟与经验建模方法相结合是至关重要的，因为它们提供了一种可靠的方法来预测现实尺寸的纳米结构中的各种量子现象，这些现象很难用密度泛函理论来处理。对于通常涉及数百万原子系统的电子结构的紧密结合(TB)模拟，以便与实验可实现的纳米级材料和器件进行直接比较，我们表明图形处理单元(GPU)器件有助于在时间和能量消耗方面节省计算成本。本文简要介绍了用于电子结构TB模拟的主要数值方法，并对GPU设备与传统多核处理器集群的性能增强策略进行了详细描述。虽然这项工作仅使用TB电子结构模拟进行基准测试，但它也可以用作提高涉及大规模稀疏矩阵的数值运算性能的实用指南。

{"title":"Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing","authors":"Oh-Kyoung Kwon, H. Ryu","doi":"10.5772/INTECHOPEN.80997","DOIUrl":"https://doi.org/10.5772/INTECHOPEN.80997","url":null,"abstract":"Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phe-nomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"36 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91479822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods 新和:一种新的用于一般迭代方法的在线ABFT格式

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907306

Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.

新兴的高性能计算平台，具有较大的组件数量和较低的功率余量，预计在逻辑电路和存储子系统中更容易受到软错误的影响。本文提出了一种基于在线算法的容错(ABFT)方法，可以有效地检测和恢复一般迭代方法的软错误。我们设计了一种新的基于校验和的矩阵向量乘法编码方案，该方案对算术和内存错误都具有弹性。我们的设计将校验和更新过程与实际计算解耦，并允许自适应校验和开销控制。基于这种新的编码机制，我们提出了两种在线ABFT设计，当结合检查点/回滚方案时，它们可以有效地从错误中恢复。这些设计能够处理不同错误率下的场景。我们的ABFT方法适用于广泛的迭代求解，主要依赖于矩阵-向量乘法和向量线性运算。我们通过综合分析和实证分析来评估我们的设计。Stampede超级计算机上的实验评估表明，对于UFL稀疏矩阵集合中最大的SPD矩阵，我们的两种ABFT方案对预条件CG(0.4%和2.2%)和预条件BiCGSTAB(1.0%和4.0%)的性能开销较低。评估还证明了我们提出的设计在一般迭代方法中检测和恢复各种类型的软误差的灵活性和有效性。

{"title":"New-Sum: A Novel Online ABFT Scheme For General Iterative Methods","authors":"Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen","doi":"10.1145/2907294.2907306","DOIUrl":"https://doi.org/10.1145/2907294.2907306","url":null,"abstract":"Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73545348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Scaling Spark on HPC Systems 在HPC系统上扩展Spark

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907310

Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan

We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.

我们报告了将Spark移植到大型生产HPC系统的经验。虽然Spark在数据中心安装(使用本地磁盘)中的性能主要由网络决定，但我们的结果表明，在使用Lustre的HPC安装中，文件系统元数据访问延迟可能占主导地位:它决定单节点性能的速度比典型工作站慢4倍。我们评估了旨在解决此问题的软件技术和硬件配置的组合。例如，在软件方面，我们开发了一个文件池层，可以将每个节点的性能提高2.8倍。在硬件方面，我们评估了一个在计算节点和后端Lustre文件系统之间有一个大NVRAM缓冲区的系统:这以牺牲每个节点的性能为代价提高了可伸缩性。总的来说，我们的结果表明，在使用Lustre和默认Spark的HPC安装中，可扩展性目前仅限于O(102)个内核。经过仔细的配置和池化，我们可以扩展到0(10^4)。正如我们的分析所表明的那样，在不久的将来观察到更高的可伸缩性是可行的。

{"title":"Scaling Spark on HPC Systems","authors":"Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan","doi":"10.1145/2907294.2907310","DOIUrl":"https://doi.org/10.1145/2907294.2907310","url":null,"abstract":"We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80865722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

BAShuffler: Maximizing Network Bandwidth Utilization in the Shuffle of YARN Shuffle:最大限度地提高YARN Shuffle的网络带宽利用率

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907296

Feng Liang, F. Lau

YARN is a popular cluster resource management platform. It does not, however, manage the network bandwidth resources which can significantly affect the execution performance of those tasks having large volumes of data to transfer within the cluster. The shuffle phase of MapReduce jobs features many such tasks. The impact of under utilization of the network bandwidth in shuffle tasks is more pronounced if the network bandwidth capacities of the nodes in the cluster are varied. We present BAShuffler, a bandwidth-aware shuffle scheduler, that can maximize the overall network bandwidth utilization by scheduling the source nodes of the fetch flows at the application level. BAShuffler can fully utilize the network bandwidth capacity in a max-min fair network. The experimental results for a variety of realistic benchmarks show that BAShuffler can substantially improve the cluster's shuffle throughput and reduce the execution time of shuffle tasks as compared to the original YARN, especially in heterogeneous network bandwidth environments.

YARN是一个流行的集群资源管理平台。但是，它不管理网络带宽资源，这可能会严重影响那些在集群内传输大量数据的任务的执行性能。MapReduce作业的shuffle阶段有很多这样的任务。当集群中节点的网络带宽容量不同时，shuffle任务中网络带宽利用率不足的影响更为明显。我们提出了一个带宽感知的shuffle调度器BAShuffler，它可以通过在应用程序级别调度获取流的源节点来最大化整体网络带宽利用率。BAShuffler可以充分利用最大最小公平网络中的网络带宽容量。各种实际基准测试的实验结果表明，与原始YARN相比，BAShuffler可以大幅提高集群的shuffle吞吐量并减少shuffle任务的执行时间，特别是在异构网络带宽环境下。

引用次数: 8

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing 第25届ACM高性能并行和分布式计算国际研讨会论文集

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294

H. Nakashima, K. Taura, Jack Lange

Welcome to the 25th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'16). HPDC'16 follows the tradition of previous versions of the conference by providing a high-quality, single-track forum for presenting new research results on all aspects of the design, implementation, evaluation, and application of parallel and distributed systems for high-end computing. The HPDC'16 program features eight sessions that cover wide range of topics including high performance networking, parallel algorithms, algorithm-based fault tolerance, big data processing, I/O optimizations, non-volatile memory, cloud, resource management, many core systems, GPUs, graph processing algorithms, and more. In these sessions, not only full papers but also short papers are presented to give a mix of novel research directions at various stages of development, which also is exhibited by a number of posters. This program is complemented by an interesting set of six workshops, FTXS, HPGP, SEM4HPC, DIDC, ROSS and ScienceCloud, on a range of timely and related systems and application topics. The conference program also features three keynote/invited talks given by Dr. Jeffrey Vetter of Oak Ridge National Laboratory, Professor Jack Dongarra of University of Tennessee, and Professor Ada Gavrilovska of Georgia Tech to memorialize the late Professor Karsten Schwan of Georgia Tech. Jack Dongarra is the recipient of the 5th HPDC Annual Achievement Award. The purpose of this award is to recognize individuals who have made long lasting, influential contributions to the foundations or practice of the field of high-performance parallel and distributed computing, to raise the awareness of these contributions, especially among the younger generation of researchers, and to improve the image and the public relations of the HPDC community. The Award Selection Committee followed the formalized process established in 2013 to select the winner with an open call for nominations. The HPDC'16 call for papers attracted 129 paper submissions. In the review process this year, we followed two established methods that were started in 2012: a two-round review process and an author rebuttal process. In the first round review, all papers received at least three reviews, and based on these reviews, 71 papers went on to the second round in which most of them received another two reviews. In total, 514 reviews were generated by the 54-member Program Committee along with a number of external reviewers. For many of the 71 second-round papers, the authors submitted rebuttals. Rebuttals were carefully taken into consideration during the Program Committee deliberations as part of the selection process. On March 10-11, the Program Committee met at University of Pittsburgh (Pittsburgh, PA) and made the final selection. Each paper in the second round of reviews was discussed at the meeting. At the end of the 1.5-day meeting, the Program Committee accepted 20 full papers, resulting in an acce

欢迎参加第25届ACM高性能并行和分布式计算研讨会(HPDC'16)。HPDC'16遵循了前几届会议的传统，提供了一个高质量的单轨论坛，展示了高端计算并行和分布式系统的设计、实现、评估和应用的各个方面的新研究成果。HPDC'16会议包括八个会议，涵盖了广泛的主题，包括高性能网络、并行算法、基于算法的容错、大数据处理、I/O优化、非易失性存储器、云、资源管理、许多核心系统、gpu、图形处理算法等。在这些会议上，不仅有完整的论文，也有简短的论文，给出了不同发展阶段的新研究方向，并通过一些海报展示。此外，还有FTXS、HPGP、SEM4HPC、DIDC、ROSS和ScienceCloud等六个有趣的研讨会，讨论一系列及时和相关的系统和应用主题。会议还邀请了橡树岭国家实验室的Jeffrey Vetter博士、田纳西大学的Jack Dongarra教授和佐治亚理工学院的Ada Gavrilovska教授进行三场主题演讲，以纪念已故的佐治亚理工学院的Karsten Schwan教授。Jack Dongarra是第五届HPDC年度成就奖的获得者。该奖项的目的是表彰对高性能并行和分布式计算领域的基础或实践做出长期、有影响力贡献的个人，提高对这些贡献的认识，特别是在年轻一代的研究人员中，并改善HPDC社区的形象和公共关系。奖项评选委员会遵循2013年建立的正式程序，以公开提名的方式选出获奖者。HPDC'16征稿共收到129篇论文。在今年的评审过程中，我们遵循了2012年开始的两种既定方法:两轮评审过程和作者反驳过程。在第一轮评审中，所有论文都至少接受了三次评审，在这些评审的基础上，71篇论文进入了第二轮，其中大多数论文又接受了两次评审。总共有514份评审由54名成员组成的项目委员会和一些外部评审人员共同完成。在71篇第二轮论文中，许多作者都提交了反驳。作为选择过程的一部分，在项目委员会的审议中仔细考虑了反驳意见。3月10日至11日，项目委员会在匹兹堡大学(Pittsburgh, PA)举行会议，并做出了最终的选择。会议讨论了第二轮评审的每篇论文。在为期1.5天的会议结束时，计划委员会接受了20篇全文，录取率为15.5%。此外，委员会还接受了9份意见书作为短文。我们要感谢所有投稿的作者，无论他们的投稿结果如何。我们非常感谢项目委员会成员的辛勤工作，感谢他们在非常紧张的审查时间表和非常严格的审查过程中提供了他们的审查。

{"title":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","authors":"H. Nakashima, K. Taura, Jack Lange","doi":"10.1145/2907294","DOIUrl":"https://doi.org/10.1145/2907294","url":null,"abstract":"Welcome to the 25th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'16). HPDC'16 follows the tradition of previous versions of the conference by providing a high-quality, single-track forum for presenting new research results on all aspects of the design, implementation, evaluation, and application of parallel and distributed systems for high-end computing. The HPDC'16 program features eight sessions that cover wide range of topics including high performance networking, parallel algorithms, algorithm-based fault tolerance, big data processing, I/O optimizations, non-volatile memory, cloud, resource management, many core systems, GPUs, graph processing algorithms, and more. In these sessions, not only full papers but also short papers are presented to give a mix of novel research directions at various stages of development, which also is exhibited by a number of posters. This program is complemented by an interesting set of six workshops, FTXS, HPGP, SEM4HPC, DIDC, ROSS and ScienceCloud, on a range of timely and related systems and application topics. \u0000 \u0000The conference program also features three keynote/invited talks given by Dr. Jeffrey Vetter of Oak Ridge National Laboratory, Professor Jack Dongarra of University of Tennessee, and Professor Ada Gavrilovska of Georgia Tech to memorialize the late Professor Karsten Schwan of Georgia Tech. \u0000 \u0000Jack Dongarra is the recipient of the 5th HPDC Annual Achievement Award. The purpose of this award is to recognize individuals who have made long lasting, influential contributions to the foundations or practice of the field of high-performance parallel and distributed computing, to raise the awareness of these contributions, especially among the younger generation of researchers, and to improve the image and the public relations of the HPDC community. The Award Selection Committee followed the formalized process established in 2013 to select the winner with an open call for nominations. \u0000 \u0000The HPDC'16 call for papers attracted 129 paper submissions. In the review process this year, we followed two established methods that were started in 2012: a two-round review process and an author rebuttal process. In the first round review, all papers received at least three reviews, and based on these reviews, 71 papers went on to the second round in which most of them received another two reviews. In total, 514 reviews were generated by the 54-member Program Committee along with a number of external reviewers. For many of the 71 second-round papers, the authors submitted rebuttals. Rebuttals were carefully taken into consideration during the Program Committee deliberations as part of the selection process. On March 10-11, the Program Committee met at University of Pittsburgh (Pittsburgh, PA) and made the final selection. Each paper in the second round of reviews was discussed at the meeting. At the end of the 1.5-day meeting, the Program Committee accepted 20 full papers, resulting in an acce","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87384861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Network-Managed Virtual Global Address Space for Message-driven Runtimes 用于消息驱动运行时的网络管理虚拟全局地址空间

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907320

Abhishek Kulkarni, Luke Dalessandro, E. Kissel, A. Lumsdaine, T. Sterling, M. Swany

Maintaining a scalable high-performance virtual global address space using distributed memory hardware has proven to be challenging. In this paper we evaluate a new approach for such an active global address space that leverages the capabilities of the network fabric to manage addressing, rather than software at the endpoint hosts. We describe our overall approach, design alternatives, and present initial experimental results that demonstrate the effectiveness and limitations of existing network hardware.

使用分布式内存硬件维护可伸缩的高性能虚拟全局地址空间已被证明是具有挑战性的。在本文中，我们评估了这种主动全局地址空间的新方法，该方法利用网络结构的功能来管理寻址，而不是在端点主机上使用软件。我们描述了我们的总体方法，设计替代方案，并提出了初步的实验结果，证明了现有网络硬件的有效性和局限性。

引用次数: 5

High-Performance Distributed RMA Locks 高性能分布式RMA锁

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907323

P. Schmid, Maciej Besta, T. Hoefler

We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.

我们提出了一种拓扑感知的分布式读写锁，它可以加速超级计算机和数据中心的不规则工作负载。锁背后的核心思想是一种模块化设计，它是三种分布式数据结构的相互作用:临界区中的读/写计数器，一组用于排序等待锁的写器的队列，以及绑定所有队列并同步写器与读器的树。每个结构都与一个参数相关联，以支持读取器或写入器，从而实现可调性能，可以将其视为三维参数空间中的一个点。我们还开发了一个分布式拓扑感知MCS锁，它是上述设计的一个构建块，并改进了最先进的MPI实现。这两种方案都使用非阻塞远程内存访问(RMA)技术来实现最高的性能和可伸缩性。我们在Cray XC30上评估了我们的方案，并说明它们比最先进的MPI-3 RMA锁定协议分别高出81%和73%。最后，我们使用它们来加速表示不规则工作负载(如键值存储或图形处理)的分布式哈希表。

{"title":"High-Performance Distributed RMA Locks","authors":"P. Schmid, Maciej Besta, T. Hoefler","doi":"10.1145/2907294.2907323","DOIUrl":"https://doi.org/10.1145/2907294.2907323","url":null,"abstract":"We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89445110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Efficient Processing of Large Graphs via Input Reduction 通过减少输入的高效处理大型图

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907312

Amlan Kusum, Keval Vora, Rajiv Gupta, Iulian Neamtiu

Large-scale parallel graph analytics involves executing iterative algorithms (e.g., PageRank, Shortest Paths, etc.) that are both data- and compute-intensive. In this work we construct faster versions of iterative graph algorithms from their original counterparts using input graph reduction. A large input graph is transformed into a small graph using a sequence of input reduction transformations. Savings in execution time are achieved using our two phased processing model that effectively runs the original iterative algorithm in two phases: first, using the reduced input graph to gain savings in execution time; and second, using the original input graph along with the results from the first phase for computing precise results. We propose several input reduction transformations and identify the structural and non-structural properties that they guarantee, which in turn are used to ensure the correctness of results while using our two phased processing model. We further present a unified input reduction algorithm that efficiently applies a non-interfering sequence of simple local input reduction transformations. Our experiments show that our transformation techniques enable significant reductions in execution time (1.25x-2.14x) while achieving precise final results for most of the algorithms. For cases where precise results cannot be achieved, the relative error remains very small (at most 0.065).

大规模并行图分析涉及执行迭代算法(例如，PageRank，最短路径等)，这些算法都是数据和计算密集型的。在这项工作中，我们使用输入图约简构建了迭代图算法的更快版本。使用一系列输入约简变换将一个大的输入图转换成一个小的图。我们的两阶段处理模型在两个阶段有效地运行原始迭代算法，从而节省了执行时间:首先，使用减少的输入图来节省执行时间;第二，使用原始输入图和第一阶段的结果计算精确结果。我们提出了几种输入约简变换，并确定了它们所保证的结构和非结构属性，这些属性反过来用于在使用我们的两阶段处理模型时确保结果的正确性。我们进一步提出了一种统一的输入约简算法，该算法有效地应用了简单的局部输入约简变换的非干扰序列。我们的实验表明，我们的转换技术能够显著减少执行时间(1.25x-2.14x)，同时对大多数算法实现精确的最终结果。对于无法获得精确结果的情况，相对误差仍然非常小(最多0.065)。

{"title":"Efficient Processing of Large Graphs via Input Reduction","authors":"Amlan Kusum, Keval Vora, Rajiv Gupta, Iulian Neamtiu","doi":"10.1145/2907294.2907312","DOIUrl":"https://doi.org/10.1145/2907294.2907312","url":null,"abstract":"Large-scale parallel graph analytics involves executing iterative algorithms (e.g., PageRank, Shortest Paths, etc.) that are both data- and compute-intensive. In this work we construct faster versions of iterative graph algorithms from their original counterparts using input graph reduction. A large input graph is transformed into a small graph using a sequence of input reduction transformations. Savings in execution time are achieved using our two phased processing model that effectively runs the original iterative algorithm in two phases: first, using the reduced input graph to gain savings in execution time; and second, using the original input graph along with the results from the first phase for computing precise results. We propose several input reduction transformations and identify the structural and non-structural properties that they guarantee, which in turn are used to ensure the correctness of results while using our two phased processing model. We further present a unified input reduction algorithm that efficiently applies a non-interfering sequence of simple local input reduction transformations. Our experiments show that our transformation techniques enable significant reductions in execution time (1.25x-2.14x) while achieving precise final results for most of the algorithms. For cases where precise results cannot be achieved, the relative error remains very small (at most 0.065).","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75144414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Session details: Parallel and Fault Tolerant algorithms 会话细节:并行和容错算法

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/3257970

A. Butt

引用次数: 0

Evaluation of Pattern Matching Workloads in Graph Analysis Systems 图分析系统中模式匹配工作量的评估

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907305

Seokyong Hong, S. Lee, Seung-Hwan Lim, S. Sukumar, Ranga Raju Vatsavai

Graph data management and mining became a popular area of research, and led to the development of plethora of systems in recent years. Unfortunately, a number of emerging graph analysis systems assume different graph data models, and support different query interface and serialization formats. Such diversity, combined with a lack of comparisons, makes it complicated to understand the trade-offs between different systems and the graph operations for which they are designed. This study presents an evaluation of graph pattern matching capabilities of six graph analysis systems, by extending the Lehigh University Benchmark to investigate the degree of effectiveness to perform the same operation over the same graph in various graph analysis systems. Through the evaluation, this study reveals both quantitative and qualitative findings.

近年来，图形数据管理和挖掘成为一个热门的研究领域，并导致了大量系统的发展。不幸的是，许多新兴的图分析系统采用不同的图数据模型，并支持不同的查询接口和序列化格式。这种多样性，加上缺乏比较，使得理解不同系统之间的权衡和它们所设计的图形操作变得复杂。本研究通过扩展利哈伊大学基准来调查在不同图分析系统中对同一图执行相同操作的有效性程度，提出了对六种图分析系统的图模式匹配能力的评估。通过评估，本研究揭示了定量和定性的研究结果。

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀