2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文中文

CooMR: Cross-task coordination for efficient data management in MapReduce programs CooMR: MapReduce程序中高效数据管理的跨任务协调

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503276

Xiaobing Li, Yandong Wang, Yizheng Jiao, Cong Xu, Weikuan Yu

Hadoop is a widely adopted open source implementation of MapReduce programming model for big data processing. It represents system resources as available map and reduce slots and assigns them to various tasks. This execution model gives little regard to the need of cross-task coordination on the use of shared system resources on a compute node, which results in task interference. In addition, the existing Hadoop merge algorithm can cause excessive I/O. In this study, we undertake an effort to address both issues. Accordingly, we have designed a cross-task coordination framework called CooMR for efficient data management in MapReduce programs. CooMR consists of three component schemes including cross-task opportunistic memory sharing and log-structured I/O consolidation, which are designed to facilitate task coordination, and the key-based in-situ merge (KISM) algorithm which is designed to enable the sorting/merging of Hadoop intermediate data without actually moving the <;key, value> pairs. Our evaluation demonstrates that CooMR is able to increase task coordination, improve system resource utilization, and significantly speed up the execution time of MapReduce programs.

Hadoop是广泛采用的MapReduce编程模型的开源实现，用于大数据处理。它将系统资源表示为可用的map和reduce槽，并将它们分配给各种任务。这种执行模型很少考虑在计算节点上使用共享系统资源时跨任务协调的需要，从而导致任务干扰。此外，现有的Hadoop合并算法会导致I/O过多。在本研究中，我们致力于解决这两个问题。因此，我们设计了一个名为CooMR的跨任务协调框架，用于MapReduce程序中的高效数据管理。CooMR由三个组件方案组成，包括跨任务机会内存共享和日志结构I/O整合，旨在促进任务协调，以及基于键的原位合并(KISM)算法，该算法旨在实现Hadoop中间数据的排序/合并，而无需实际移动数据对。我们的评估表明，CooMR能够增加任务协调，提高系统资源利用率，并显着加快MapReduce程序的执行时间。

{"title":"CooMR: Cross-task coordination for efficient data management in MapReduce programs","authors":"Xiaobing Li, Yandong Wang, Yizheng Jiao, Cong Xu, Weikuan Yu","doi":"10.1145/2503210.2503276","DOIUrl":"https://doi.org/10.1145/2503210.2503276","url":null,"abstract":"Hadoop is a widely adopted open source implementation of MapReduce programming model for big data processing. It represents system resources as available map and reduce slots and assigns them to various tasks. This execution model gives little regard to the need of cross-task coordination on the use of shared system resources on a compute node, which results in task interference. In addition, the existing Hadoop merge algorithm can cause excessive I/O. In this study, we undertake an effort to address both issues. Accordingly, we have designed a cross-task coordination framework called CooMR for efficient data management in MapReduce programs. CooMR consists of three component schemes including cross-task opportunistic memory sharing and log-structured I/O consolidation, which are designed to facilitate task coordination, and the key-based in-situ merge (KISM) algorithm which is designed to enable the sorting/merging of Hadoop intermediate data without actually moving the <;key, value> pairs. Our evaluation demonstrates that CooMR is able to increase task coordination, improve system resource utilization, and significantly speed up the execution time of MapReduce programs.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128096533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers 在千兆级超级计算机上使用基于dag的运行时系统调查应用程序的可移植性

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503250

Qingyu Meng, A. Humphrey, John A. Schmidt, M. Berzins

Present trends in high performance computing present formidable challenges for applications code using multicore nodes possibly with accelerators and/or co-processors and reduced memory while still attaining scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities offer a possible solution. The Uintah framework for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for CPU cores and/or accelerators/co-processors on a node. Uintah's clear separation between application and runtime code has led to scalability increases of 1000x without significant changes to application code. This methodology is tested on three leading Top500 machines; OLCF Titan, TACC Stampede and ALCF Mira using three diverse and challenging applications problems. This investigation of scalability with regard to the different processors and communications performance leads to the overall conclusion that the adaptive DAG-based approach provides a very powerful abstraction for solving challenging multi-scale multi-physics engineering problems on some of the largest and most powerful computers available today.

高性能计算的当前趋势对使用多核节点的应用程序代码提出了严峻的挑战，这些节点可能带有加速器和/或协处理器，并且在获得可伸缩性的同时减少了内存。使用运行时系统执行与机器无关的应用程序代码的软件框架提供了一种可能的解决方案，该系统可以使用户免受架构复杂性的影响。例如，utah框架使用流体流动求解器与基于颗粒的固体方法相结合，解决了结构化自适应网格上的一系列大规模问题。在节点上为CPU内核和/或加速器/协处理器提供可扩展的异步和动态运行时系统，从而执行计算任务的有向无环图。untah在应用程序和运行时代码之间的清晰分离使得可伸缩性提高了1000倍，而无需对应用程序代码进行重大更改。该方法在三台领先的Top500机器上进行了测试;OLCF Titan, TACC Stampede和ALCF Mira使用三个不同且具有挑战性的应用问题。对不同处理器和通信性能的可伸缩性的研究得出了一个总体结论，即基于自适应dag的方法为解决当今一些最大和最强大的计算机上具有挑战性的多尺度多物理工程问题提供了一个非常强大的抽象。

{"title":"Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers","authors":"Qingyu Meng, A. Humphrey, John A. Schmidt, M. Berzins","doi":"10.1145/2503210.2503250","DOIUrl":"https://doi.org/10.1145/2503210.2503250","url":null,"abstract":"Present trends in high performance computing present formidable challenges for applications code using multicore nodes possibly with accelerators and/or co-processors and reduced memory while still attaining scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities offer a possible solution. The Uintah framework for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for CPU cores and/or accelerators/co-processors on a node. Uintah's clear separation between application and runtime code has led to scalability increases of 1000x without significant changes to application code. This methodology is tested on three leading Top500 machines; OLCF Titan, TACC Stampede and ALCF Mira using three diverse and challenging applications problems. This investigation of scalability with regard to the different processors and communications performance leads to the overall conclusion that the adaptive DAG-based approach provides a very powerful abstraction for solving challenging multi-scale multi-physics engineering problems on some of the largest and most powerful computers available today.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131829959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

20 Petaflops simulation of proteins suspensions in crowding conditions 20千万亿次模拟拥挤条件下的蛋白质悬浮液

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2504563

M. Bernaschi, M. Bisson, M. Fatica, S. Melchionna

We present performance results for the simulation of proteins suspensions in crowding conditions obtained with MUPHY, a computational platform for multi-scale simulations of real-life biofluidic problems. Previous versions of MU-PHY have been used in the past for the simulation of blood flow through the human coronary arteries and DNA translocation across nanopores. The simulation exhibits excellent scalability up to 18, 000 K20X Nvidia GPUs and achieves almost 20 Petaflops of aggregate sustained performance with a peak performance of 27.5 Petaflops for the most intensive computing component. Those figures demonstrate once again the flexibility of MUPHY in simulating biofluidic phenomena, exploiting at their best the features of the architecture in use. Preliminary results were obtained in the present case on a completely different platform, the IBM Blue Gene/Q. The combination of novel mathematical models, computational algorithms, hardware technology, code tuning and parallelization techniques required to achieve these results are presented.

我们展示了用MUPHY模拟拥挤条件下蛋白质悬浮液的性能结果，MUPHY是一个用于模拟现实生活中生物流体问题的多尺度计算平台。以前的MU-PHY版本已经用于模拟人类冠状动脉的血液流动和DNA在纳米孔中的易位。模拟显示了出色的可扩展性，高达18,000 K20X Nvidia gpu，并实现了近20 Petaflops的总持续性能，对于最密集的计算组件，峰值性能为27.5 Petaflops。这些数字再次证明了MUPHY在模拟生物流体现象方面的灵活性，充分利用了所使用的体系结构的特点。在本案例中，初步结果是在一个完全不同的平台上获得的，IBM Blue Gene/Q。提出了实现这些结果所需的新颖数学模型、计算算法、硬件技术、代码调优和并行化技术的组合。

引用次数: 11

Globalizing selectively: Shared-memory efficiency with address-space separation 选择性全球化:具有地址空间分离的共享内存效率

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503275

N. Mahajan, Uday Pitambare, A. Chauhan

It has become common for MPI-based applications to run on shared-memory machines. However, MPI semantics do not allow leveraging shared memory fully for communication between processes from within the MPI library. This paper presents an approach that combines compiler transformations with a specialized runtime system to achieve zero-copy communication whenever possible by proving certain properties statically and globalizing data selectively by altering the allocation and deallocation of communication buffers. The runtime system provides dynamic optimization, when such proofs are not possible statically, by copying data only when there are write-write or read-write conflicts. We implemented a prototype compiler, using ROSE, and evaluated it on several benchmarks. Our system produces code that performs better than MPI in most cases and no worse than MPI, tuned for shared memory, in all cases.

基于mpi的应用程序在共享内存机器上运行已经变得很常见。但是，MPI语义不允许在MPI库中的进程之间充分利用共享内存进行通信。本文提出了一种将编译器转换与专门的运行时系统相结合的方法，通过静态地证明某些属性和通过改变通信缓冲区的分配和释放来选择性地全球化数据，从而在可能的情况下实现零复制通信。运行时系统通过仅在存在write-write或read-write冲突时复制数据来提供动态优化，当静态地无法进行此类证明时。我们使用ROSE实现了一个原型编译器，并在几个基准测试中对其进行了评估。我们的系统生成的代码在大多数情况下都比MPI执行得好，并且在所有情况下都不会比MPI更差(针对共享内存进行了调优)。

引用次数: 0

On fast parallel detection of strongly connected components (SCC) in small-world graphs 小世界图中强连通分量的快速并行检测

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503246

Sungpack Hong, Nicole C. Rodia, K. Olukotun

Detecting strongly connected components (SCCs) in a directed graph is a fundamental graph analysis algorithm that is used in many science and engineering domains. Traditional approaches in parallel SCC detection, however, show limited performance and poor scaling behavior when applied to large real-world graph instances. In this paper, we investigate the shortcomings of the conventional approach and propose a series of extensions that consider the fundamental properties of real-world graphs, e.g. the small-world property. Our scalable implementation offers excellent performance on diverse, small-world graphs resulting in a 5.01× to 29.41× parallel speedup over the optimal sequential algorithm with 16 cores and 32 hardware threads.

检测有向图中的强连通分量(SCCs)是一种基本的图分析算法，在许多科学和工程领域都有应用。然而，传统的并行SCC检测方法在应用于现实世界的大型图实例时表现出有限的性能和不良的扩展行为。在本文中，我们研究了传统方法的缺点，并提出了一系列考虑真实世界图的基本性质的扩展，例如小世界性质。我们的可扩展实现在各种小世界图上提供了出色的性能，与具有16核和32个硬件线程的最佳顺序算法相比，并行速度提高了5.01倍到29.41倍。

引用次数: 95

Characterization and modeling of PIDX parallel I/O for performance optimization 用于性能优化的PIDX并行I/O的表征和建模

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503252

Sidharth Kumar, A. Saha, V. Vishwanath, P. Carns, John A. Schmidt, G. Scorzelli, H. Kolla, R. Grout, R. Latham, R. Ross, M. Papka, Jacqueline H. Chen, Valerio Pascucci

Parallel I/O library performance can vary greatly in response to user-tunable parameter values such as aggregator count, file count, and aggregation strategy. Unfortunately, manual selection of these values is time consuming and dependent on characteristics of the target machine, the underlying file system, and the dataset itself. Some characteristics, such as the amount of memory per core, can also impose hard constraints on the range of viable parameter values. In this work we address these problems by using machine learning techniques to model the performance of the PIDX parallel I/O library and select appropriate tunable parameter values. We characterize both the network and I/O phases of PIDX on a Cray XE6 as well as an IBM Blue Gene/P system. We use the results of this study to develop a machine learning model for parameter space exploration and performance prediction.

并行I/O库的性能随着用户可调参数值(如聚合器计数、文件计数和聚合策略)的变化而变化很大。不幸的是，手动选择这些值非常耗时，并且依赖于目标机器、底层文件系统和数据集本身的特征。某些特性，例如每个内核的内存量，也可能对可行参数值的范围施加硬约束。在这项工作中，我们通过使用机器学习技术来模拟PIDX并行I/O库的性能并选择适当的可调参数值来解决这些问题。我们描述了Cray XE6和IBM Blue Gene/P系统上PIDX的网络和I/O阶段。我们利用这项研究的结果开发了一个用于参数空间探索和性能预测的机器学习模型。

引用次数: 26

ACIC: Automatic cloud I/O configurator for HPC applications ACIC:用于HPC应用程序的自动云I/O配置器

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503216

Mingliang Liu, Ye Jin, Jidong Zhai, Yan Zhai, Qianqian Shi, Xiaosong Ma, Wenguang Chen

The cloud has become a promising alternative to traditional HPC centers or in-house clusters. This new environment highlights the I/O bottleneck problem, typically with top-of-the-line compute instances but sub-par communication and I/O facilities. It has been observed that changing cloud I/O system configurations leads to significant variation in the performance and cost efficiency of I/O intensive HPC applications. However, storage system configuration is tedious and error-prone to do manually, even for experts. This paper proposes ACIC, which takes a given application running on a given cloud platform, and automatically searches for optimized I/O system configurations. ACIC utilizes machine learning models to perform black-box performance/cost predictions. To tackle the high-dimensional parameter exploration space unique to cloud platforms, we enable affordable, reusable, and incremental training guided by Plackett and Burman Matrices. Results with four representative applications indicate that ACIC consistently identifies near-optimal configurations among a large group of candidate settings.

云已经成为传统HPC中心或内部集群的一个有前途的替代方案。这种新环境突出了I/O瓶颈问题，通常使用一流的计算实例，但通信和I/O设施低于标准。据观察，改变云I/O系统配置会导致I/O密集型HPC应用程序的性能和成本效率发生显著变化。但是，即使是专家，手动配置存储系统也很繁琐且容易出错。本文提出了ACIC，它在给定的云平台上运行给定的应用程序，并自动搜索优化的I/O系统配置。ACIC利用机器学习模型来执行黑箱性能/成本预测。为了解决云平台特有的高维参数探索空间，我们实现了由Plackett和Burman矩阵指导的经济实惠，可重用和增量培训。四个代表性应用程序的结果表明，ACIC在大量候选设置中一致地识别出接近最佳的配置。

{"title":"ACIC: Automatic cloud I/O configurator for HPC applications","authors":"Mingliang Liu, Ye Jin, Jidong Zhai, Yan Zhai, Qianqian Shi, Xiaosong Ma, Wenguang Chen","doi":"10.1145/2503210.2503216","DOIUrl":"https://doi.org/10.1145/2503210.2503216","url":null,"abstract":"The cloud has become a promising alternative to traditional HPC centers or in-house clusters. This new environment highlights the I/O bottleneck problem, typically with top-of-the-line compute instances but sub-par communication and I/O facilities. It has been observed that changing cloud I/O system configurations leads to significant variation in the performance and cost efficiency of I/O intensive HPC applications. However, storage system configuration is tedious and error-prone to do manually, even for experts. This paper proposes ACIC, which takes a given application running on a given cloud platform, and automatically searches for optimized I/O system configurations. ACIC utilizes machine learning models to perform black-box performance/cost predictions. To tackle the high-dimensional parameter exploration space unique to cloud platforms, we enable affordable, reusable, and incremental training guided by Plackett and Burman Matrices. Results with four representative applications indicate that ACIC consistently identifies near-optimal configurations among a large group of candidate settings.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122736646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Enabling fair pricing on HPC systems with node sharing 在具有节点共享的HPC系统上实现公平定价

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2503256

Alex D. Breslow, Ananta Tiwari, M. Schulz, L. Carrington, Lingjia Tang, Jason Mars

Co-location, where multiple jobs share compute nodes in large-scale HPC systems, has been shown to increase aggregate throughput and energy efficiency by 10 to 20%. However, system operators disallow co-location due to fair-pricing concerns, i.e., a pricing mechanism that considers performance interference from co-running jobs. In the current pricing model, application execution time determines the price, which results in unfair prices paid by the minority of users whose jobs suffer from co-location. This paper presents POPPA, a runtime system that enables fair pricing by delivering precise online interference detection and facilitates the adoption of supercomputers with co-locations. POPPA leverages a novel shutter mechanism - a cyclic, fine-grained interference sampling mechanism to accurately deduce the interference between co-runners - to provide unbiased pricing of jobs that share nodes. POPPA is able to quantify inter-application interference within 4% mean absolute error on a variety of co-located benchmark and real scientific workloads.

在大型HPC系统中，多个作业共享计算节点的协同位置(Co-location)已被证明可以将总吞吐量和能源效率提高10%至20%。然而，由于公平定价的考虑，系统运营商不允许共址，也就是说，定价机制考虑了共同运行作业的性能干扰。在当前的定价模型中，应用程序的执行时间决定了价格，这导致少数用户支付了不公平的价格，他们的工作受到了托管的影响。本文介绍了POPPA，这是一个运行时系统，通过提供精确的在线干扰检测来实现公平定价，并促进采用具有共同位置的超级计算机。POPPA利用一种新颖的快门机制——一种循环的、细粒度的干扰采样机制，以准确地推断共同运行者之间的干扰——为共享节点的作业提供无偏定价。POPPA能够在各种共置基准测试和实际科学工作负载上量化应用程序间干扰，平均绝对误差在4%以内。

引用次数: 26

Radiative signature of the relativistic Kelvin-Helmholtz Instability 相对论性开尔文-亥姆霍兹不稳定性的辐射特征

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2504564

M. Bussmann, H. Burau, T. Cowan, A. Debus, A. Huebl, G. Juckeland, T. Kluge, W. Nagel, R. Pausch, Felix Schmitt, U. Schramm, Joseph Schuchart, R. Widera

We present a particle-in-cell simulation of the relativistic Kelvin-Helmholtz Instability (KHI) that for the first time delivers angularly resolved radiation spectra of the particle dynamics during the formation of the KHI. This enables studying the formation of the KHI with unprecedented spatial, angular and spectral resolution. Our results are of great importance for understanding astrophysical jet formation and comparable plasma phenomena by relating the particle motion observed in the KHI to its radiation signature. The innovative methods presented here on the implementation of the particle-in-cell algorithm on graphic processing units can be directly adapted to any many-core parallelization of the particle-mesh method. With these methods we see a peak performance of 7.176 PFLOP/s (double-precision) plus 1.449 PFLOP/s (single-precision), an efficiency of 96% when weakly scaling from 1 to 18432 nodes, an efficiency of 68.92% and a speed up of 794 (ideal: 1152) when strongly scaling from 16 to 18432 nodes.

我们提出了相对论性开尔文-亥姆霍兹不稳定性(KHI)的粒子胞内模拟，首次提供了KHI形成过程中粒子动力学的角分辨辐射光谱。这使得以前所未有的空间、角度和光谱分辨率研究KHI的形成成为可能。我们的研究结果对于理解天体物理射流的形成和类似的等离子体现象具有重要意义，通过将KHI中观测到的粒子运动与其辐射特征联系起来。本文提出的在图形处理单元上实现粒子网格算法的创新方法可以直接适用于粒子网格方法的任何多核并行化。使用这些方法，我们看到峰值性能为7.176 PFLOP/s(双精度)加上1.449 PFLOP/s(单精度)，从1到18432个节点弱扩展时效率为96%，从16到18432个节点强扩展时效率为68.92%，速度为794(理想值:1152)。

{"title":"Radiative signature of the relativistic Kelvin-Helmholtz Instability","authors":"M. Bussmann, H. Burau, T. Cowan, A. Debus, A. Huebl, G. Juckeland, T. Kluge, W. Nagel, R. Pausch, Felix Schmitt, U. Schramm, Joseph Schuchart, R. Widera","doi":"10.1145/2503210.2504564","DOIUrl":"https://doi.org/10.1145/2503210.2504564","url":null,"abstract":"We present a particle-in-cell simulation of the relativistic Kelvin-Helmholtz Instability (KHI) that for the first time delivers angularly resolved radiation spectra of the particle dynamics during the formation of the KHI. This enables studying the formation of the KHI with unprecedented spatial, angular and spectral resolution. Our results are of great importance for understanding astrophysical jet formation and comparable plasma phenomena by relating the particle motion observed in the KHI to its radiation signature. The innovative methods presented here on the implementation of the particle-in-cell algorithm on graphic processing units can be directly adapted to any many-core parallelization of the particle-mesh method. With these methods we see a peak performance of 7.176 PFLOP/s (double-precision) plus 1.449 PFLOP/s (single-precision), an efficiency of 96% when weakly scaling from 1 to 18432 nodes, an efficiency of 68.92% and a speed up of 794 (ideal: 1152) when strongly scaling from 16 to 18432 nodes.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131764739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

HACC: Extreme scaling and performance across diverse architectures HACC：跨不同架构的极致扩展和性能

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

Pub Date : 2013-11-17 DOI: 10.1145/2503210.2504566

S. Habib, V. Morozov, N. Frontiere, H. Finkel, A. Pope, K. Heitmann

Supercomputing is evolving towards hybrid and accelerator-based architectures with millions of cores. The HACC (Hardware/Hybrid Accelerated Cosmology Code) framework exploits this diverse landscape at the largest scales of problem size, obtaining high scalability and sustained performance. Developed to satisfy the science requirements of cosmological surveys, HACC melds particle and grid methods using a novel algorithmic structure that flexibly maps across architectures, including CPU/GPU, multi/many-core, and Blue Gene systems. We demonstrate the success of HACC on two very different machines, the CPU/GPU system Titan and the BG/Q systems Sequoia and Mira, attaining unprecedented levels of scalable performance. We demonstrate strong and weak scaling on Titan, obtaining up to 99.2% parallel efficiency, evolving 1.1 trillion particles. On Sequoia, we reach 13.94 PFlops (69.2% of peak) and 90% parallel efficiency on 1,572,864 cores, with 3.6 trillion particles, the largest cosmological benchmark yet performed. HACC design concepts are applicable to several other supercomputer applications.

超级计算正朝着拥有数百万内核的混合和基于加速器的架构发展。HACC（硬件/混合加速宇宙学代码）框架在问题规模最大的情况下利用了这种多样化的格局，获得了高可扩展性和持续性能。HACC 是为满足宇宙学调查的科学要求而开发的，它使用一种新颖的算法结构将粒子和网格方法融合在一起，这种结构可以灵活地映射到各种架构，包括 CPU/GPU、多核/单核和蓝色基因系统。我们在两种截然不同的机器（CPU/GPU 系统 Titan 以及 BG/Q 系统 Sequoia 和 Mira）上展示了 HACC 的成功，达到了前所未有的可扩展性能水平。我们在 Titan 上演示了强扩展和弱扩展，获得了高达 99.2% 的并行效率，演化了 1.1 万亿个粒子。在红杉上，我们在 1,572,864 个内核上实现了 13.94 PFlops（峰值的 69.2%）和 90% 的并行效率，演化了 3.6 万亿个粒子，这是迄今为止执行的最大宇宙学基准。HACC 的设计理念适用于其他几种超级计算机应用。

{"title":"HACC: Extreme scaling and performance across diverse architectures","authors":"S. Habib, V. Morozov, N. Frontiere, H. Finkel, A. Pope, K. Heitmann","doi":"10.1145/2503210.2504566","DOIUrl":"https://doi.org/10.1145/2503210.2504566","url":null,"abstract":"Supercomputing is evolving towards hybrid and accelerator-based architectures with millions of cores. The HACC (Hardware/Hybrid Accelerated Cosmology Code) framework exploits this diverse landscape at the largest scales of problem size, obtaining high scalability and sustained performance. Developed to satisfy the science requirements of cosmological surveys, HACC melds particle and grid methods using a novel algorithmic structure that flexibly maps across architectures, including CPU/GPU, multi/many-core, and Blue Gene systems. We demonstrate the success of HACC on two very different machines, the CPU/GPU system Titan and the BG/Q systems Sequoia and Mira, attaining unprecedented levels of scalable performance. We demonstrate strong and weak scaling on Titan, obtaining up to 99.2% parallel efficiency, evolving 1.1 trillion particles. On Sequoia, we reach 13.94 PFlops (69.2% of peak) and 90% parallel efficiency on 1,572,864 cores, with 3.6 trillion particles, the largest cosmological benchmark yet performed. HACC design concepts are applicable to several other supercomputer applications.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128330930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀