2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文中文

Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes 使用位表示优化方案在gpu上加速稀疏矩阵向量乘法

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503234

Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, S. Kuo, R. Goh, S. Turner, W. Wong

The sparse matrix-vector (SpMV) multiplication routine is an important building block used in many iterative algorithms for solving scientific and engineering problems. One of the main challenges of SpMV is its memory-boundedness. Although compression has been proposed previously to improve SpMV performance on CPUs, its use has not been demonstrated on the GPU because of the serial nature of many compression and decompression schemes. In this paper, we introduce a family of bit-representation-optimized (BRO) compression schemes for representing sparse matrices on GPUs. The proposed schemes, BRO-ELL, BRO-COO, and BRO-HYB, perform compression on index data and help to speed up SpMV on GPUs through reduction of memory traffic. Furthermore, we formulate a BRO-aware matrix reodering scheme as a data clustering problem and use it to increase compression ratios. With the proposed schemes, experiments show that average speedups of 1.5× compared to ELLPACK and HYB can be achieved for SpMV on GPUs.

稀疏矩阵向量(SpMV)乘法例程是解决科学和工程问题的许多迭代算法中使用的重要组成部分。SpMV的主要挑战之一是它的内存受限性。虽然以前已经提出压缩来提高cpu上的SpMV性能，但由于许多压缩和解压缩方案的串行性质，它的使用尚未在GPU上得到证明。本文介绍了一组用于在gpu上表示稀疏矩阵的位表示优化(BRO)压缩方案。提出的BRO-ELL、BRO-COO和BRO-HYB方案对索引数据进行压缩，并通过减少内存流量来加快gpu上的SpMV。在此基础上，我们提出了一个bro感知矩阵重排序方案作为数据聚类问题，并利用它来提高压缩比。实验表明，SpMV在gpu上的平均速度是ELLPACK和HYB的1.5倍。

{"title":"Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes","authors":"Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, S. Kuo, R. Goh, S. Turner, W. Wong","doi":"10.1145/2503210.2503234","DOIUrl":"https://doi.org/10.1145/2503210.2503234","url":null,"abstract":"The sparse matrix-vector (SpMV) multiplication routine is an important building block used in many iterative algorithms for solving scientific and engineering problems. One of the main challenges of SpMV is its memory-boundedness. Although compression has been proposed previously to improve SpMV performance on CPUs, its use has not been demonstrated on the GPU because of the serial nature of many compression and decompression schemes. In this paper, we introduce a family of bit-representation-optimized (BRO) compression schemes for representing sparse matrices on GPUs. The proposed schemes, BRO-ELL, BRO-COO, and BRO-HYB, perform compression on index data and help to speed up SpMV on GPUs through reduction of memory traffic. Furthermore, we formulate a BRO-aware matrix reodering scheme as a data clustering problem and use it to increase compression ratios. With the proposed schemes, experiments show that average speedups of 1.5× compared to ELLPACK and HYB can be achieved for SpMV on GPUs.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130742177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? 商用cpu的超级计算:移动soc为高性能计算做好准备了吗?

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503281

Nikola Rajovic, P. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramírez, M. Valero

In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in high-performance computing. This transformation has been so effective that the June 2013 TOP500 list is still dominated by x86. In 2013, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones and tablets, most of which are built with ARM-based SoCs. This leads to the suggestion that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This paper addresses this question in detail. We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. We also present our experience evaluating performance and efficiency of mobile SoCs, deploying a cluster and evaluating the network and scalability of production applications. In summary, we give a first answer as to whether mobile SoCs are ready for HPC.

在20世纪90年代后期，强大的经济力量导致在高性能计算中采用商用桌面处理器。这种转变非常有效，以至于2013年6月的TOP500榜单仍然由x86主导。2013年，计算领域最大的商品市场不是个人电脑或服务器，而是移动计算，包括智能手机和平板电脑，其中大多数都采用基于arm的soc。这导致了这样的建议:一旦移动soc提供了足够的性能，移动soc就可以帮助降低HPC的成本。本文详细论述了这一问题。我们分析了移动SoC性能的趋势，并将其与20世纪90年代的类似趋势进行了比较。我们还介绍了我们在评估移动soc的性能和效率、部署集群以及评估生产应用程序的网络和可扩展性方面的经验。综上所述，我们给出了移动soc是否为HPC做好准备的第一个答案。

引用次数: 167

A scalable parallel algorithm for dynamic range-limited n-tuple computation in many-body molecular dynamics simulation 多体分子动力学模拟中动态范围有限n元组计算的可扩展并行算法

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503235

Manaschai Kunaseth, R. Kalia, A. Nakano, K. Nomura, P. Vashishta

Recent advancements in reactive molecular dynamics (MD) simulations based on many-body interatomic potentials necessitate efficient dynamic n-tuple computation, where a set of atomic n-tuples within a given spatial range is constructed at every time step. Here, we develop a computation-pattern algebraic framework to mathematically formulate general n-tuple computation. Based on translation/reflection-invariant properties of computation patterns within this framework, we design a shift-collapse (SC) algorithm for cell-based parallel MD. Theoretical analysis quantifies the compact n-tuple search space and small communication cost of SC-MD for arbitrary n, which are reduced to those in best pair-computation approaches (e.g. eighth-shell method) for n = 2. Benchmark tests show that SC-MD outperforms our production MD code at the finest grain, with 9.7-and 5.1-fold speedups on Intel-Xeon and BlueGene/Q clusters. SC-MD also exhibits excellent strong scalability.

基于多体原子间势的反应分子动力学(MD)模拟的最新进展需要高效的动态n元组计算，其中在每个时间步长构建给定空间范围内的一组原子n元组。在这里，我们开发了一个计算模式代数框架，以数学方式表述一般的n元组计算。基于该框架内计算模式的平移/反射不变特性，我们设计了一种基于cell的并行MD的shift-collapse (SC)算法。理论分析量化了任意n下SC-MD的紧凑n元组搜索空间和较小的通信开销，并将其简化为n = 2时的最佳对计算方法(如八壳法)。基准测试表明，SC-MD在最细粒度上优于我们的生产MD代码，在Intel-Xeon和BlueGene/Q集群上的速度提高了9.7倍和5.1倍。SC-MD还具有很强的可扩展性。

引用次数: 12

Distributed-memory parallel algorithms for generating massive scale-free networks using preferential attachment model 基于优先依恋模型的大规模无标度网络的分布式内存并行算法

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503291

M. Alam, Maleq Khan, M. Marathe

Recently, there has been substantial interest in the study of various random networks as mathematical models of complex systems. As these complex systems grow larger, the ability to generate progressively large random networks becomes all the more important. This motivates the need for efficient parallel algorithms for generating such networks. Naive parallelization of the sequential algorithms for generating random networks may not work due to the dependencies among the edges and the possibility of creating duplicate (parallel) edges. In this paper, we present MPI-based distributed memory parallel algorithms for generating random scale-free networks using the preferential-attachment model. Our algorithms scale very well to a large number of processors and provide almost linear speedups. The algorithms can generate scale-free networks with 50 billion edges in 123 seconds using 768 processors.

近年来，人们对各种随机网络作为复杂系统的数学模型的研究产生了浓厚的兴趣。随着这些复杂系统变得越来越大，生成逐渐变大的随机网络的能力变得越来越重要。这就需要高效的并行算法来生成这样的网络。由于边缘之间的依赖关系和创建重复(并行)边缘的可能性，用于生成随机网络的顺序算法的朴素并行化可能无法工作。在本文中，我们提出了基于mpi的分布式内存并行算法，用于使用优先-依恋模型生成随机无标度网络。我们的算法可以很好地扩展到大量的处理器，并提供几乎线性的速度提升。该算法可以在123秒内使用768个处理器生成500亿个边的无标度网络。

引用次数: 40

Enabling comprehensive data-driven system management for large computational facilities 为大型计算设施提供全面的数据驱动系统管理

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503230

J. Browne, R. L. Deleon, Charng-Da Lu, Matthew D. Jones, S. Gallo, Amin Ghadersohi, A. Patra, W. Barth, John L. Hammond, T. Furlani, R. McLay

This paper presents a tool chain, based on the open source tool TACC_Stats, for systematic and comprehensive job level resource use measurement for large cluster computers, and its incorporation into XDMoD, a reporting and analytics framework for resource management that targets meeting the information needs of users, application developers, systems administrators, systems management and funding managers. Accounting, scheduler and event logs are integrated with system performance data from TACC_Stats. TACC_Stats periodically records resource use including many hardware counters for each job running on each node. Furthermore, system level metrics are obtained through aggregation of the node (job) level data. Analysis of this data generates many types of standard and custom reports and even a limited predictive capability that has not previously been available for open-source, Linux-based software systems. This paper presents case studies of information that can be applied for effective resource management. We believe this system to be the first fully comprehensive system for supporting the information needs of all stakeholders in open-source software based HPC systems.

本文提出了一个基于开源工具TACC_Stats的工具链，用于对大型集群计算机进行系统和全面的作业级资源使用度量，并将其集成到XDMoD中，XDMoD是一种用于资源管理的报告和分析框架，旨在满足用户、应用程序开发人员、系统管理员、系统管理人员和资金管理人员的信息需求。会计、调度器和事件日志与TACC_Stats的系统性能数据集成在一起。TACC_Stats定期记录资源使用情况，包括每个节点上运行的每个作业的许多硬件计数器。此外，通过节点(作业)级数据的聚合获得系统级度量。对这些数据的分析可以生成许多类型的标准和定制报告，甚至可以提供有限的预测能力，这在以前的开源、基于linux的软件系统中是不可用的。本文介绍了可用于有效资源管理的信息案例研究。我们相信这个系统是第一个完全全面的系统，用于支持基于开源软件的HPC系统中所有利益相关者的信息需求。

{"title":"Enabling comprehensive data-driven system management for large computational facilities","authors":"J. Browne, R. L. Deleon, Charng-Da Lu, Matthew D. Jones, S. Gallo, Amin Ghadersohi, A. Patra, W. Barth, John L. Hammond, T. Furlani, R. McLay","doi":"10.1145/2503210.2503230","DOIUrl":"https://doi.org/10.1145/2503210.2503230","url":null,"abstract":"This paper presents a tool chain, based on the open source tool TACC_Stats, for systematic and comprehensive job level resource use measurement for large cluster computers, and its incorporation into XDMoD, a reporting and analytics framework for resource management that targets meeting the information needs of users, application developers, systems administrators, systems management and funding managers. Accounting, scheduler and event logs are integrated with system performance data from TACC_Stats. TACC_Stats periodically records resource use including many hardware counters for each job running on each node. Furthermore, system level metrics are obtained through aggregation of the node (job) level data. Analysis of this data generates many types of standard and custom reports and even a limited predictive capability that has not previously been available for open-source, Linux-based software systems. This paper presents case studies of information that can be applied for effective resource management. We believe this system to be the first fully comprehensive system for supporting the information needs of all stakeholders in open-source software based HPC systems.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114756273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

The Science DMZ: A network design pattern for data-intensive science 科学DMZ:用于数据密集型科学的网络设计模式

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503245

E. Dart, Lauren Rotman, B. Tierney, Mary Hester, J. Zurawski

The ever-increasing scale of scientific data has become a significant challenge for researchers that rely on networks to interact with remote computing systems and transfer results to collaborators worldwide. Despite the availability of high-capacity connections, scientists struggle with inadequate cyberinfrastructure that cripples data transfer performance, and impedes scientific progress. The Science DMZ paradigm comprises a proven set of network design patterns that collectively address these problems for scientists. We explain the Science DMZ model, including network architecture, system configuration, cybersecurity, and performance tools, that creates an optimized network environment for science. We describe use cases from universities, supercomputing centers and research laboratories, highlighting the effectiveness of the Science DMZ model in diverse operational settings. In all, the Science DMZ model is a solid platform that supports any science workflow, and flexibly accommodates emerging network technologies. As a result, the Science DMZ vastly improves collaboration, accelerating scientific discovery.

不断增长的科学数据规模已经成为依赖网络与远程计算系统交互并将结果传递给全球合作者的研究人员的重大挑战。尽管有高容量连接，但科学家们仍在努力解决网络基础设施不足的问题，这削弱了数据传输性能，阻碍了科学进步。Science DMZ范例包含一组经过验证的网络设计模式，这些模式共同为科学家解决了这些问题。我们解释了Science DMZ模型，包括网络架构、系统配置、网络安全和性能工具，它为科学创建了一个优化的网络环境。我们描述了来自大学、超级计算中心和研究实验室的用例，强调了Science DMZ模型在各种操作设置中的有效性。总之，Science DMZ模型是一个可靠的平台，它支持任何科学工作流程，并灵活地适应新兴的网络技术。因此，科学非军事区极大地改善了协作，加速了科学发现。

{"title":"The Science DMZ: A network design pattern for data-intensive science","authors":"E. Dart, Lauren Rotman, B. Tierney, Mary Hester, J. Zurawski","doi":"10.1145/2503210.2503245","DOIUrl":"https://doi.org/10.1145/2503210.2503245","url":null,"abstract":"The ever-increasing scale of scientific data has become a significant challenge for researchers that rely on networks to interact with remote computing systems and transfer results to collaborators worldwide. Despite the availability of high-capacity connections, scientists struggle with inadequate cyberinfrastructure that cripples data transfer performance, and impedes scientific progress. The Science DMZ paradigm comprises a proven set of network design patterns that collectively address these problems for scientists. We explain the Science DMZ model, including network architecture, system configuration, cybersecurity, and performance tools, that creates an optimized network environment for science. We describe use cases from universities, supercomputing centers and research laboratories, highlighting the effectiveness of the Science DMZ model in diverse operational settings. In all, the Science DMZ model is a solid platform that supports any science workflow, and flexibly accommodates emerging network technologies. As a result, the Science DMZ vastly improves collaboration, accelerating scientific discovery.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123972315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 203

Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503257

Vilas Sridharan, Jon Stearley, Nathan Debardeleben, S. Blanchard, S. Gurumurthi

Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing systems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance computing systems. Our goal is to understand the factors that influence faults in production settings. We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study the impact of altitude and rack placement on SRAM faults, finding that, as expected, altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate.

最近的一些出版物证实，故障在高性能计算系统中很常见。因此，进一步关注此类计算系统所经历的故障是有必要的。本文对大型高性能计算系统中的DRAM和SRAM故障进行了研究。我们的目标是了解在生产环境中影响故障的因素。我们研究了老化对DRAM的影响，发现在DRAM寿命的前两年，从永久故障到瞬态故障的显著转变。我们检查了DRAM供应商的影响，发现供应商之间的故障率差异超过4倍。我们检查DRAM设备和数据中心故障的物理位置;与之前的研究相反，我们发现两者都没有相关性。最后，我们研究了海拔高度和机架放置位置对SRAM故障的影响，发现正如预期的那样，海拔高度对SRAM故障有很大影响，机架放置位置的顶部与故障率高出20%相关。

引用次数: 178

MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters MVAPICH-PRISM:一个基于代理的通信框架，使用InfiniBand和SCIF用于Intel MIC集群

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503288

S. Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, K. Kandalla, H. Subramoni, D. Panda

Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.

Xeon Phi基于Intel多集成核心(MIC)架构，在单个芯片上封装高达1TFLOPs的性能，同时提供x86_64兼容性。另一方面，InfiniBand是超级计算系统中最流行的互连选择之一。Xeon Phi处理器上的软件栈允许进程直接访问节点上的InfiniBand HCA，从而为节点间通信提供低延迟路径。然而，像Sandy Bridge这样最先进的芯片组的缺点限制了这些传输的可用带宽。在本文中，我们提出了一种新的基于代理的MVAPICH-PRISM框架来优化此类系统的通信性能。我们提出了几种设计，并使用微基准测试和应用程序内核对它们进行了评估。我们的设计将Xeon Phi处理器之间的节点间延迟提高了65%，节点间带宽提高了5倍。我们的设计将256个进程的MPI_Alltoall操作的性能提高了65%。它们将3D Stencil通信内核和P3DFFT库的性能分别提高了56%和22%，分别具有1,024和512个进程。

{"title":"MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters","authors":"S. Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, K. Kandalla, H. Subramoni, D. Panda","doi":"10.1145/2503210.2503288","DOIUrl":"https://doi.org/10.1145/2503210.2503288","url":null,"abstract":"Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132017537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Hybrid MPI: Efficient message passing for multi-core systems 混合MPI:多核系统的高效消息传递

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503294

A. Friedley, G. Bronevetsky, T. Hoefler, A. Lumsdaine

Multi-core shared memory architectures are ubiquitous in both High-Performance Computing (HPC) and commodity systems because they provide an excellent trade-off between performance and programmability. MPI's abstraction of explicit communication across distributed memory is very popular for programming scientific applications. Unfortunately, OS-level process separations force MPI to perform unnecessary copying of messages within shared memory nodes. This paper presents a novel approach that transparently shares memory across MPI processes executing on the same node, allowing them to communicate like threaded applications. While prior work explored thread-based MPI libraries, we demonstrate that this approach is impractical and performs poorly in practice. We instead propose a novel process-based approach that enables shared memory communication and integrates with existing MPI libraries and applications without modifications. Our protocols for shared memory message passing exhibit better performance and reduced cache footprint. Communication speedups of more than 26% are demonstrated for two applications.

多核共享内存架构在高性能计算(HPC)和商用系统中都很普遍，因为它们在性能和可编程性之间提供了很好的权衡。MPI对跨分布式内存的显式通信的抽象在编程科学应用中非常流行。不幸的是，操作系统级别的进程分离迫使MPI在共享内存节点中执行不必要的消息复制。本文提出了一种新颖的方法，可以透明地在同一节点上执行的MPI进程之间共享内存，从而允许它们像线程应用程序一样进行通信。虽然先前的工作探索了基于线程的MPI库，但我们证明了这种方法是不切实际的，并且在实践中表现不佳。我们提出了一种新的基于进程的方法，可以实现共享内存通信，并且无需修改即可与现有的MPI库和应用程序集成。我们的共享内存消息传递协议表现出更好的性能和更少的缓存占用。两个应用程序的通信速度超过26%。

{"title":"Hybrid MPI: Efficient message passing for multi-core systems","authors":"A. Friedley, G. Bronevetsky, T. Hoefler, A. Lumsdaine","doi":"10.1145/2503210.2503294","DOIUrl":"https://doi.org/10.1145/2503210.2503294","url":null,"abstract":"Multi-core shared memory architectures are ubiquitous in both High-Performance Computing (HPC) and commodity systems because they provide an excellent trade-off between performance and programmability. MPI's abstraction of explicit communication across distributed memory is very popular for programming scientific applications. Unfortunately, OS-level process separations force MPI to perform unnecessary copying of messages within shared memory nodes. This paper presents a novel approach that transparently shares memory across MPI processes executing on the same node, allowing them to communicate like threaded applications. While prior work explored thread-based MPI libraries, we demonstrate that this approach is impractical and performs poorly in practice. We instead propose a novel process-based approach that enables shared memory communication and integrates with existing MPI libraries and applications without modifications. Our protocols for shared memory message passing exhibit better performance and reduced cache footprint. Communication speedups of more than 26% are demonstrated for two applications.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132526773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Physics-based seismic hazard analysis on petascale heterogeneous supercomputers 基于物理的千万亿次异构超级计算机地震危险性分析

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503300

Yifeng Cui, E. Poyraz, K. Olsen, Jun Zhou, K. Withers, S. Callaghan, J. Larkin, C. Guest, D. J. Choi, A. Chourasia, Zheqiang Shi, S. Day, P. Maechling, T. Jordan

We have developed a highly scalable and efficient GPU-based finite-difference code (AWP) for earthquake simulation that implements high throughput, memory locality, communication reduction and communication / computation overlap and achieves linear scalability on Cray XK7 Titan at ORNL and NCSA's Blue Waters system. We simulate realistic 0-10 Hz earthquake ground motions relevant to building engineering design using high-performance AWP. Moreover, we show that AWP provides a speedup by a factor of 110 in key strain tensor calculations critical to probabilistic seismic hazard analysis (PSHA). These performance improvements to critical scientific application software, coupled with improved co-scheduling capabilities of our workflow-managed systems, make a statewide hazard model a goal reachable with existing supercomputers. The performance improvements of GPU-based AWP are expected to save millions of core-hours over the next few years as physics-based seismic hazard analysis is developed using heterogeneous petascale supercomputers.

我们开发了一种高度可扩展和高效的基于gpu的有限差分代码(AWP)，用于地震模拟，实现了高吞吐量、内存局域性、通信减少和通信/计算重叠，并在ORNL的Cray XK7 Titan和NCSA的Blue Waters系统上实现了线性可扩展性。我们使用高性能AWP模拟与建筑工程设计相关的真实0-10 Hz地震地面运动。此外，我们表明AWP在概率地震危害分析(PSHA)的关键应变张量计算中提供了110倍的加速。这些关键科学应用软件的性能改进，加上我们的工作流管理系统改进的协同调度能力，使全州范围的危害模型成为现有超级计算机可以实现的目标。基于gpu的AWP的性能改进有望在未来几年内节省数百万核小时，因为基于物理的地震危害分析是使用异构千兆级超级计算机开发的。

{"title":"Physics-based seismic hazard analysis on petascale heterogeneous supercomputers","authors":"Yifeng Cui, E. Poyraz, K. Olsen, Jun Zhou, K. Withers, S. Callaghan, J. Larkin, C. Guest, D. J. Choi, A. Chourasia, Zheqiang Shi, S. Day, P. Maechling, T. Jordan","doi":"10.1145/2503210.2503300","DOIUrl":"https://doi.org/10.1145/2503210.2503300","url":null,"abstract":"We have developed a highly scalable and efficient GPU-based finite-difference code (AWP) for earthquake simulation that implements high throughput, memory locality, communication reduction and communication / computation overlap and achieves linear scalability on Cray XK7 Titan at ORNL and NCSA's Blue Waters system. We simulate realistic 0-10 Hz earthquake ground motions relevant to building engineering design using high-performance AWP. Moreover, we show that AWP provides a speedup by a factor of 110 in key strain tensor calculations critical to probabilistic seismic hazard analysis (PSHA). These performance improvements to critical scientific application software, coupled with improved co-scheduling capabilities of our workflow-managed systems, make a statewide hazard model a goal reachable with existing supercomputers. The performance improvements of GPU-based AWP are expected to save millions of core-hours over the next few years as physics-based seismic hazard analysis is developed using heterogeneous petascale supercomputers.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123735209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀