2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文中文

Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems 面向二维量子强相关系统的密度矩阵重整群多核平台并行化设计

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063467

S. Yamada, Toshiyuki Imamura, M. Machida

One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches for highly-correlated electrons, the density matrix renormalization group (DMRG) has been widely accepted as the most promising numerical scheme compared to Monte Carlo and exact diagonalization in terms of accuracy and accessible system size. In fact, DMRG almost perfectly resolves one-dimensional chain like long quantum systems. In this paper, we suggest its extended approach toward higher-dimensional systems by high-performance computing techniques. The computing target in DMRG is a huge non-uniform sparse matrix diagonalization. In order to efficiently parallelize the part, we implement communication step doubling together with reuse of the mid-point data between the doubled two steps to avoid severe bottleneck of all-to-all communications essential for the diagonalization. The technique is successful even for clusters composed of more than 1000 cores and offers a trustworthy exploration way for two-dimensional highly-correlated systems.

在现代凝聚态物理学中，最吸引人的问题之一是理解高度相关的电子结构，并为减少二氧化碳的未来提出新的设备设计。在高度相关电子的各种数值方法中，密度矩阵重整化群(DMRG)已被广泛接受为与蒙特卡罗和精确对角化相比，在精度和可访问的系统尺寸方面最有前途的数值方案。事实上，DMRG几乎完美地解决了像长量子系统一样的一维链。在本文中，我们建议通过高性能计算技术将其扩展到高维系统。DMRG的计算目标是一个巨大的非均匀稀疏矩阵对角化。为了有效地并行化部件，我们实现了通信步骤加倍，并重用加倍两步之间的中点数据，以避免对角化所必需的所有对所有通信的严重瓶颈。该技术对于超过1000个核心组成的集群也是成功的，并且为二维高相关系统提供了一种可靠的探索方式。

{"title":"Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems","authors":"S. Yamada, Toshiyuki Imamura, M. Machida","doi":"10.1145/2063384.2063467","DOIUrl":"https://doi.org/10.1145/2063384.2063467","url":null,"abstract":"One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches for highly-correlated electrons, the density matrix renormalization group (DMRG) has been widely accepted as the most promising numerical scheme compared to Monte Carlo and exact diagonalization in terms of accuracy and accessible system size. In fact, DMRG almost perfectly resolves one-dimensional chain like long quantum systems. In this paper, we suggest its extended approach toward higher-dimensional systems by high-performance computing techniques. The computing target in DMRG is a huge non-uniform sparse matrix diagonalization. In order to efficiently parallelize the part, we implement communication step doubling together with reuse of the mid-point data between the doubled two steps to avoid severe bottleneck of all-to-all communications essential for the diagonalization. The technique is successful even for clusters composed of more than 1000 cores and offers a trustworthy exploration way for two-dimensional highly-correlated systems.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"265 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123107271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Server-side I/O coordination for parallel file systems 并行文件系统的服务器端I/O协调

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063407

Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang

Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.

并行文件系统已经成为现代高端计算机的常见组件，以掩盖磁盘数据访问速度和CPU计算能力之间不断扩大的差距。然而，尽管对于某些应用程序工作得很好，但当前并行文件系统缺乏有效处理具有数据同步需求的并发I/O请求的能力，而并发I/O在数据密集型应用程序中是常态。认识到一个I/O请求将不会完成，直到并行文件系统中所有涉及的文件服务器都完成了他们的部分，在本文中，我们提出了一个并行文件系统的服务器端I/O协调方案。其基本思想是协调文件服务器一次为一个应用程序服务，以减少完成时间，同时保持服务器的利用率和公平性。为了达到我们的目的，引入了一个窗口范围的协调概念。本文提出了I/O协调算法，并对其平均完成时间进行了相应的分析。我们还在PVFS2文件系统和MPI-IO环境下实现了该方案的原型。实验结果表明，与PVFS2采用的默认数据访问策略相比，该方案可将平均完成时间缩短8% ~ 46%，并且在高I/O工作负载下提供更高的I/O带宽。实验结果还表明，服务器端I/O协调方案具有良好的可扩展性。

{"title":"Server-side I/O coordination for parallel file systems","authors":"Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang","doi":"10.1145/2063384.2063407","DOIUrl":"https://doi.org/10.1145/2063384.2063407","url":null,"abstract":"Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116590747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Improving communication performance in dense linear algebra via topology aware collectives 利用拓扑感知集体改进密集线性代数中的通信性能

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063487

Edgar Solomonik, A. Bhatele, J. Demmel

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.

最近的研究结果表明，拓扑感知映射减少了大规模并行机器上通信密集型内核中的网络争用。我们证明了在网状互连上，拓扑感知映射也允许利用高效的拓扑感知集合。我们映射了新的2.5D密集线性代数算法来利用由Blue Gene/P超级计算机分配的长方体分区上的矩形集体。我们的映射允许算法利用优化的线路多播和减少。常用的2D算法不能以这种方式进行映射。在Blue Gene/P的16,384个节点(65,536个内核)上，利用矩形集合的2.5D算法明显快于2D矩阵乘法(MM)和LU分解，分别高达8.7倍和2.1倍。这些加速是由于通信减少(相对于2D MM, 2.5D MM高达95.6%)。我们还推导了基于LogP的矩形广播和约简的新性能模型。利用这些，我们在假设的百亿亿级架构上对矩阵乘法和LU分解的性能进行了建模。

{"title":"Improving communication performance in dense linear algebra via topology aware collectives","authors":"Edgar Solomonik, A. Bhatele, J. Demmel","doi":"10.1145/2063384.2063487","DOIUrl":"https://doi.org/10.1145/2063384.2063487","url":null,"abstract":"Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129079990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

A scalable eigensolver for large scale-free graphs using 2D graph partitioning 基于二维图划分的大规模无标度图的可扩展特征解算器

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063469

A. Yoo, A. Baker, R. Pearce, V. Henson

Eigensolvers are important tools for analyzing and mining useful information from scale-free graphs. Such graphs are used in many applications and can be extremely large. Unfortunately, existing parallel eigensolvers do not scale well for these graphs due to the high communication overhead in the parallel matrix-vector multiplication (MatVec). We develop a MatVec algorithm based on 2D edge partitioning that significantly reduces the communication costs and embed it into a popular eigensolver library. We demonstrate that the enhanced eigensolver can attain two orders of magnitude performance improvement compared to the original on a state-of-art massively parallel machine. We illustrate the performance of the embedded MatVec by computing eigenvalues of a scale-free graph with 300 million vertices and 5 billion edges, the largest scale-free graph analyzed by any in-memory parallel eigensolver, to the best of our knowledge.

特征解算器是分析和挖掘无标度图中有用信息的重要工具。这种图在许多应用程序中使用，并且可以非常大。不幸的是，由于并行矩阵向量乘法(MatVec)的高通信开销，现有的并行特征求解器不能很好地扩展这些图。我们开发了一种基于二维边缘划分的MatVec算法，该算法显著降低了通信成本，并将其嵌入到一个流行的特征求解器库中。我们证明，与最先进的大规模并行机上的原始特征求解器相比，增强的特征求解器可以获得两个数量级的性能改进。我们通过计算具有3亿个顶点和50亿个边的无标度图的特征值来说明嵌入式MatVec的性能，据我们所知，这是任何内存中并行特征求解器分析的最大的无标度图。

引用次数: 47

End-to-end network QoS via scheduling of flexible resource reservation requests 端到端网络QoS通过调度灵活的资源预留请求

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063475

Sushant Sharma, D. Katramatos, Dantong Yu

Modern data-intensive applications move vast amounts of data between multiple locations around the world. To enable predictable and reliable data transfers, next generation networks allow such applications to reserve network resources for exclusive use. In this paper, we solve an important problem (called SMR3) to accommodate multiple and concurrent network reservation requests between a pair of end sites. Given the varying availability of bandwidth within the network, our goal is to accommodate as many reservation requests as possible while minimizing the total time needed to complete the data transfers. First, we prove that SMR3 is an NP-hard problem. Then, we solve it by developing a polynomial-time heuristic called RRA. The RRA algorithm hinges on an efficient mechanism to accommodate large number of requests in an iterative manner. Finally, we show via numerical results that RRA constructs schedules that accommodate significantly larger number of requests compared to other, seemingly efficient, heuristics.

现代数据密集型应用程序在世界各地的多个地点之间移动大量数据。为了实现可预测和可靠的数据传输，下一代网络允许这些应用程序保留网络资源供独占使用。在本文中，我们解决了一个重要的问题(称为SMR3)，以适应一对终端站点之间的多个并发网络预订请求。考虑到网络中带宽可用性的变化，我们的目标是容纳尽可能多的预订请求，同时最小化完成数据传输所需的总时间。首先，我们证明了SMR3是一个np困难问题。然后，我们通过开发一个称为RRA的多项式时间启发式来解决它。RRA算法依赖于一种有效的机制，以迭代的方式容纳大量的请求。最后，我们通过数值结果表明，与其他看似有效的启发式方法相比，RRA构建的调度可以容纳大量的请求。

引用次数: 31

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer 在TSUBAME 2.0超级计算机上进行枝晶凝固的pb级相场模拟

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063388

T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka

The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.

金属材料的力学性能在很大程度上取决于其内部的微观组织。为了开发具有预期性能的工程材料，预测凝固金属中的模式是必不可少的。相场模拟是目前已知的最有效的模拟二元合金凝固过程中微观枝晶生长的方法。然而，为了评估凝固的真实描述，相场模拟需要在细粒度网格上计算大量复杂的非线性项。由于计算量大，以前用相场法模拟三维凝固的工作只能成功地描述简单的形状。我们的新模拟技术达到了前所未有的规模，足以处理材料科学中需要的复杂枝晶结构。我们在东京工业大学的gpu丰富的TSUBAME 2.0超级计算机上的模拟显示了良好的弱缩放，并在我们最大的配置下实现了1.017 PFlops的单精度，使用了4,000个CPU和16,000个CPU内核。

{"title":"Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer","authors":"T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka","doi":"10.1145/2063384.2063388","DOIUrl":"https://doi.org/10.1145/2063384.2063388","url":null,"abstract":"The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129498360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 200

BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots BlobCR: IaaS云上使用虚拟磁盘映像快照的高性能计算应用程序的高效检查点重启

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063429

Bogdan Nicolae, F. Cappello

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.

基础设施即服务(IaaS)云计算作为运行科学应用程序的替代平台，正引起工业界和学术界的极大兴趣。考虑到IaaS云的动态特性以及此类应用程序的长运行时间和资源利用率，在这种情况下，有效的检查点重新启动机制变得至关重要。本文提出了一种以最小化检查点重启的存储空间和性能开销为目标的解决方案。我们介绍了一种方法，该方法利用在客户机级别运行的检查点重新启动协议中的虚拟机(VM)磁盘映像多快照和多部署，以便有效地捕获和回滚应用程序的完整状态，包括文件系统修改。G5K测试平台上的实验表明，MPI应用程序比现有方法有了实质性的改进，无论是在应用程序级别上提供定制检查点的情况，还是需要在进程级别处理检查点的情况。

引用次数: 87

Logjam: A scalable unified log file archiver Logjam:一个可扩展的统一日志文件归档器

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063348.2063379

N. Cardo

Log files are a necessary record of events on any system. However, as systems scale, so does the volume of data captured. To complicate matters, this data can be distributed across all nodes within the system. This creates challenges in ways to obtain these files as well as archiving them in a consistent manner. It has become commonplace to develop a custom written utility for each system that is tailored specifically to that system. For computer centers that contain multiple systems, each system would have their own respective utility for gathering and archiving log files. Each time a new log file is produced, a modification to the utility is necessary. With each modification, risk of errors could be introduced as well as spending time to introduce that change. This is precisely the purpose of logjam. Once installed, the code only requires modification when new features are required. A configuration file is used to identify each log file as well as where to harvest it and how to archive it. Adding a new log file is as simple as defining it in a configuration file and testing can be performed in the production environment.

日志文件是任何系统上事件的必要记录。然而，随着系统的扩展，捕获的数据量也会随之增加。更复杂的是，这些数据可以分布在系统内的所有节点上。这给获取这些文件的方法以及以一致的方式归档它们带来了挑战。为每个系统开发专门针对该系统的自定义编写的实用程序已经变得司空见惯。对于包含多个系统的计算机中心，每个系统都有各自的实用程序来收集和归档日志文件。每次生成新的日志文件时，都需要对实用程序进行修改。对于每次修改，可能会引入错误的风险以及花费时间来引入该更改。这正是僵局的目的。安装后，只有在需要新功能时才需要修改代码。配置文件用于标识每个日志文件，以及在哪里获取它以及如何归档它。添加新的日志文件就像在配置文件中定义它一样简单，并且可以在生产环境中执行测试。

引用次数: 1

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime 在具有多核优化的消息驱动运行时的千万亿次机器上启用和缩放1亿个原子的生物分子模拟

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063466

Chao Mei, Yanhua Sun, G. Zheng, Eric J. Bohm, L. Kalé, James C. Phillips, Christopher B. Harrison

A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the Charm++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.

使用NAMD进行的1亿原子生物分子模拟是nsf资助的可持续千万亿次机器的三个基准之一。在千兆级机器上模拟这个大分子系统带来了巨大的挑战，包括处理I/O、大内存占用和获得良好的强扩展结果。在本文中，我们提出了并行I/O技术来实现仿真。通过扩展Charm++异步消息驱动的运行时，设计了一个新的SMP模型，以有效地利用无处不在的宽多核集群。我们利用节点感知技术来优化应用程序和底层SMP运行时。在橡树岭国家实验室，分层负载平衡被进一步利用，将NAMD扩展到完整的Jaguar PF Cray XT5(224,076核)，无论是否有PME全静电，在一个简单的截止计算中，以每步9毫秒的速度实现93%的并行效率(vs 6720核)。在阿贡国家实验室的65,536个无畏蓝色基因/P内核上也获得了出色的缩放。

{"title":"Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime","authors":"Chao Mei, Yanhua Sun, G. Zheng, Eric J. Bohm, L. Kalé, James C. Phillips, Christopher B. Harrison","doi":"10.1145/2063384.2063466","DOIUrl":"https://doi.org/10.1145/2063384.2063466","url":null,"abstract":"A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the Charm++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129137709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Scalable fast multipole methods on distributed heterogeneous architectures 分布式异构体系结构的可扩展快速多极方法

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2011-11-12 DOI: 10.1145/2063384.2063432

Qi Hu, N. Gumerov, R. Duraiswami

We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.

我们从根本上重新考虑快速多极方法(FMM)在计算节点上的实现，该计算节点具有异构CPU-GPU架构，具有多核CPU(s)和一个或多个GPU加速器，以及在此类节点的互连集群上。FMM是一种分治算法，它使用空间分解执行快速的n体求和，通常用于时间步进或迭代循环。观察到FMM的局部求和和基于分析的平移部分是独立的，我们将它们分别映射到gpu和cpu上。对FMM进行了仔细的分析，以便在多核cpu和GPU加速器之间最佳地分配工作。我们首先开发了一个单节点版本，其中CPU部分使用OpenMP并行化，GPU版本通过CUDA并行化。提出了用于创建FMM数据结构的新的并行算法以及单节点和分布式多节点版本的负载均衡策略。我们的实现可以在4.23秒内对16个节点上的128M个粒子进行n体求和，这是其他文献中没有达到的性能。

{"title":"Scalable fast multipole methods on distributed heterogeneous architectures","authors":"Qi Hu, N. Gumerov, R. Duraiswami","doi":"10.1145/2063384.2063432","DOIUrl":"https://doi.org/10.1145/2063384.2063432","url":null,"abstract":"We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀