Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00071
Pradeep Ambati, Noman Bashir, D. Irwin, P. Shenoy
While cloud platforms enable users to rent computing resources on demand to execute their jobs, buying fixed resources is still much cheaper than renting if their utilization is high. Thus, optimizing cloud costs requires users to determine how many fixed resources to buy versus rent based on their workload. In this paper, we introduce the concept of a waiting policy for cloud-enabled schedulers, which is the dual of a scheduling policy, and show that the optimal cost depends on it. We define multiple waiting policies and develop simple analytical models to reveal their tradeoff between fixed resource provisioning, cost, and job waiting time. We evaluate the impact of these waiting policies on a year-long production batch workload consisting of 14Mjobs run on a 14.3k-core cluster, and show that a compound waiting policy decreases the cost (by 5%) and mean job waiting time (by 7×) compared to a fixed cluster of the current size.
{"title":"Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers","authors":"Pradeep Ambati, Noman Bashir, D. Irwin, P. Shenoy","doi":"10.1109/SC41405.2020.00071","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00071","url":null,"abstract":"While cloud platforms enable users to rent computing resources on demand to execute their jobs, buying fixed resources is still much cheaper than renting if their utilization is high. Thus, optimizing cloud costs requires users to determine how many fixed resources to buy versus rent based on their workload. In this paper, we introduce the concept of a waiting policy for cloud-enabled schedulers, which is the dual of a scheduling policy, and show that the optimal cost depends on it. We define multiple waiting policies and develop simple analytical models to reveal their tradeoff between fixed resource provisioning, cost, and job waiting time. We evaluate the impact of these waiting policies on a year-long production batch workload consisting of 14Mjobs run on a 14.3k-core cluster, and show that a compound waiting policy decreases the cost (by 5%) and mean job waiting time (by 7×) compared to a fixed cluster of the current size.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"355 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132794934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00070
Haiyang Shi, Xiaoyi Lu
Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and 49.55%, respectively.
{"title":"INEC: Fast and Coherent In-Network Erasure Coding","authors":"Haiyang Shi, Xiaoyi Lu","doi":"10.1109/SC41405.2020.00070","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00070","url":null,"abstract":"Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and 49.55%, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130978045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00101
Troels Henriksen, Sune Hellfritzsch, P. Sadayappan, C. Oancea
We present and evaluate an implementation technique for histogram-like computations on GPUs that ensures both work-efficient asymptotic cost, support for arbitrary associative and commutative operators, and efficient use of hardwaresupported atomic operations when applicable. Based on a systematic empirical examination of the design space, we develop a technique that balances conflict rates and memory footprint. We demonstrate our technique both as a library implementation in CUDA, as well as by extending the parallel array language Futhark with a new construct for expressing generalized histograms, and by supporting this construct with several compiler optimizations. We show that our histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.
{"title":"Compiling Generalized Histograms for GPU","authors":"Troels Henriksen, Sune Hellfritzsch, P. Sadayappan, C. Oancea","doi":"10.1109/SC41405.2020.00101","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00101","url":null,"abstract":"We present and evaluate an implementation technique for histogram-like computations on GPUs that ensures both work-efficient asymptotic cost, support for arbitrary associative and commutative operators, and efficient use of hardwaresupported atomic operations when applicable. Based on a systematic empirical examination of the design space, we develop a technique that balances conflict rates and memory footprint. We demonstrate our technique both as a library implementation in CUDA, as well as by extending the parallel array language Futhark with a new construct for expressing generalized histograms, and by supporting this construct with several compiler optimizations. We show that our histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125128638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00007
C. Kato, Y. Yamade, K. Nagano, Kiyoshi Kumahata, K. Minami, Tatsuo Nishikawa
To realize numerical towing-tank tests by substantially shortening the time to the solution, a general-purpose Finite-Element flow solver, named FrontFlow/blue (FFB), has been fully optimized so as to achieve maximum possible sustained memory throughputs with three of its four hot kernels. A single-node sustained performance of 179.0 GFLOPS, which corresponds to 5.3% of the peak performance, has been achieved on Fugaku, the next flagship computer of Japan. A weak-scale benchmark test has confirmed that FFB runs with a parallel efficiency of over 85% up to 5,505,024 compute cores, and an overall sustained performance of16.7 PFLOPS has been achieved. As a result, the time needed for large-eddy simulation using 32 billion grids has been significantly reduced from almost two days to only 37 min., or by a factor of 71. This has clearly indicated that a numerical towing-tank could actually be built for ship hydrodynamics within a few years.
{"title":"Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation","authors":"C. Kato, Y. Yamade, K. Nagano, Kiyoshi Kumahata, K. Minami, Tatsuo Nishikawa","doi":"10.1109/SC41405.2020.00007","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00007","url":null,"abstract":"To realize numerical towing-tank tests by substantially shortening the time to the solution, a general-purpose Finite-Element flow solver, named FrontFlow/blue (FFB), has been fully optimized so as to achieve maximum possible sustained memory throughputs with three of its four hot kernels. A single-node sustained performance of 179.0 GFLOPS, which corresponds to 5.3% of the peak performance, has been achieved on Fugaku, the next flagship computer of Japan. A weak-scale benchmark test has confirmed that FFB runs with a parallel efficiency of over 85% up to 5,505,024 compute cores, and an overall sustained performance of16.7 PFLOPS has been achieved. As a result, the time needed for large-eddy simulation using 32 billion grids has been significantly reduced from almost two days to only 37 min., or by a factor of 71. This has clearly indicated that a numerical towing-tank could actually be built for ship hydrodynamics within a few years.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114549674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00058
Bradley Swain, Yanze Li, Peiming Liu, I. Laguna, G. Georgakoudis, Jeff Huang
We present OMPRACER, a static tool that uses flow-sensitive, interprocedural analysis to detect data races in OpenMP programs. OMPRACER is fast, scalable, has high code coverage, and supports the most common OpenMP features by combining state-of-the-art pointer analysis, novel value-flow analysis, happens-before tracking, and generalized modelling of OpenMP APIs. Unlike dynamic tools that currently dominate data race detection, OMPRACER achieves almost 100% code coverage using static analysis to detect a broader category of races without running the program or relying on specific input or runtime behaviour. OMPRACER has competitive precision with dynamic tools like Archer and ROMP: passing 105/116 cases in DataRaceBench with a total accuracy of 91%. OMPRACER has been used to analyze several Exascale Computing Project proxy applications containing over 2 million lines of code in under 10 minutes. OMPRACER has revealed previously unknown races in an ECP proxy app and a production simulation for COVID19.
{"title":"OMPRacer: A Scalable and Precise Static Race Detector for OpenMP Programs","authors":"Bradley Swain, Yanze Li, Peiming Liu, I. Laguna, G. Georgakoudis, Jeff Huang","doi":"10.1109/SC41405.2020.00058","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00058","url":null,"abstract":"We present OMPRACER, a static tool that uses flow-sensitive, interprocedural analysis to detect data races in OpenMP programs. OMPRACER is fast, scalable, has high code coverage, and supports the most common OpenMP features by combining state-of-the-art pointer analysis, novel value-flow analysis, happens-before tracking, and generalized modelling of OpenMP APIs. Unlike dynamic tools that currently dominate data race detection, OMPRACER achieves almost 100% code coverage using static analysis to detect a broader category of races without running the program or relying on specific input or runtime behaviour. OMPRACER has competitive precision with dynamic tools like Archer and ROMP: passing 105/116 cases in DataRaceBench with a total accuracy of 91%. OMPRACER has been used to analyze several Exascale Computing Project proxy applications containing over 2 million lines of code in under 10 minutes. OMPRACER has revealed previously unknown races in an ECP proxy app and a production simulation for COVID19.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121993919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00037
Mihailo Isakov, Eliakin Del Rosario, S. Madireddy, Prasanna Balaprakash, P. Carns, R. Ross, M. Kinsy
With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years’ worth of Darshan logs from the Argonne Leadership Computing Facility’s Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: a data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs, and interpreting I/O bottlenecks. We find groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyze these groups instead of individual jobs, allowing us to reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.
{"title":"HPC I/O Throughput Bottleneck Analysis with Explainable Local Models","authors":"Mihailo Isakov, Eliakin Del Rosario, S. Madireddy, Prasanna Balaprakash, P. Carns, R. Ross, M. Kinsy","doi":"10.1109/SC41405.2020.00037","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00037","url":null,"abstract":"With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years’ worth of Darshan logs from the Argonne Leadership Computing Facility’s Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: a data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs, and interpreting I/O bottlenecks. We find groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyze these groups instead of individual jobs, allowing us to reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126722574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00089
Hengjie Wang, Aparna Chandramowlishwaran
Stencil computations are at the core of various Computational Fluid Dynamics (CFD) applications and have been well-studied for several decades. Typically they’re highly memory-bound and as a result, numerous tiling algorithms have been proposed to improve its performance. Although efficient, most of these algorithms are designed for single iteration spaces on shared-memory machines. However, in CFD, we are confronted with multi-block structured girds composed of multiple connected iteration spaces distributed across many nodes.In this paper, we propose a pipelined stencil algorithm called Pencil for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces. Based on an in-depth analysis of cache tiling on a single node, we first identify both the optimal combination of MPI and OpenMP for temporal tiling and the best tiling approach, which outperforms the state-of-the-art automatic parallelization tool Pluto by up to $1.92 times$. Then, we adopt DeepHalo to decouple the multiple connected iteration spaces so that temporal tiling can be applied to each space. Finally, we achieve overlap by pipelining the computation and communication without sacrificing the advantage from temporal cache tiling. Pencil is evaluated using 4 stencils across 6 numerical schemes on two distributed memory machines with Omni-Path and InfiniBand networks. On the Omni-Path system, Pencil exhibits outstanding weak and strong scalability for up to 128 nodes and outperforms MPI+OpenMP Funneled with space tiling by $1.33- 3.41 times$ on a multi-block grid with 32 nodes.
{"title":"Pencil: A Pipelined Algorithm for Distributed Stencils","authors":"Hengjie Wang, Aparna Chandramowlishwaran","doi":"10.1109/SC41405.2020.00089","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00089","url":null,"abstract":"Stencil computations are at the core of various Computational Fluid Dynamics (CFD) applications and have been well-studied for several decades. Typically they’re highly memory-bound and as a result, numerous tiling algorithms have been proposed to improve its performance. Although efficient, most of these algorithms are designed for single iteration spaces on shared-memory machines. However, in CFD, we are confronted with multi-block structured girds composed of multiple connected iteration spaces distributed across many nodes.In this paper, we propose a pipelined stencil algorithm called Pencil for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces. Based on an in-depth analysis of cache tiling on a single node, we first identify both the optimal combination of MPI and OpenMP for temporal tiling and the best tiling approach, which outperforms the state-of-the-art automatic parallelization tool Pluto by up to $1.92 times$. Then, we adopt DeepHalo to decouple the multiple connected iteration spaces so that temporal tiling can be applied to each space. Finally, we achieve overlap by pipelining the computation and communication without sacrificing the advantage from temporal cache tiling. Pencil is evaluated using 4 stencils across 6 numerical schemes on two distributed memory machines with Omni-Path and InfiniBand networks. On the Omni-Path system, Pencil exhibits outstanding weak and strong scalability for up to 128 nodes and outperforms MPI+OpenMP Funneled with space tiling by $1.33- 3.41 times$ on a multi-block grid with 32 nodes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125809107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-07DOI: 10.1109/SC41405.2020.00062
K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, Michael Morrison, V. Kibardin, Andrey Portnoy, J. Dietiker, M. Syamlal, Michael James
The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a $600times 595times 1536$ mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.
对于PDE代码,基于cpu和gpu的系统的性能通常很低,因为PDE代码必须求解大型、稀疏且通常是结构化的线性方程系统。迭代求解器受到数据移动的限制,无论是在缓存和内存之间,还是在节点之间。在这里,我们描述了在Cerebras systems CS-1上求解此类方程组的方法,CS-1是一种具有良好内存带宽和通信延迟的晶圆级处理器。我们在单晶圆级系统上实现了0.86 PFLOPS,通过BiCGStab解决了一个线性系统,该系统由7点有限差分模板在600 595 1536$网格上产生,实现了大约三分之一的机器峰值性能。我们解释了系统,它的架构和编程,以及它在这个问题和相关问题上的性能。我们讨论了内存容量和浮点精度的问题。我们概述了将这项工作扩展到全面应用的计划。
{"title":"Fast Stencil-Code Computation on a Wafer-Scale Processor","authors":"K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, Michael Morrison, V. Kibardin, Andrey Portnoy, J. Dietiker, M. Syamlal, Michael James","doi":"10.1109/SC41405.2020.00062","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00062","url":null,"abstract":"The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a $600times 595times 1536$ mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131929303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/SC41405.2020.00096
Luca Bertagna, O. Guba, M. Taylor, J. Foucar, J. Larkin, A. Bradley, S. Rajamanickam, A. Salinger
We present an effort to port the nonhydrostatic atmosphere dynamical core of the Energy Exascale Earth System Model (E3SM) to efficiently run on a variety of architectures, including conventional CPU, many-core CPU, and GPU. We specifically target cloud-resolving resolutions of 3 km and 1 km. To express on-node parallelism we use the C++ library Kokkos, which allows us to achieve a performance portable code in a largely architecture-independent way. Our C++ implementation is at least as fast as the original Fortran implementation on IBM Power9 and Intel Knights Landing processors, proving that the code refactor did not compromise the efficiency on CPU architectures. On the other hand, when using the GPUs, our implementation is able to achieve 0.97 Simulated Years Per Day, running on the full Summit supercomputer. To the best of our knowledge, this is the most achieved to date by any global atmosphere dynamical core running at such resolutions.
{"title":"A Performance-Portable Nonhydrostatic Atmospheric Dycore for the Energy Exascale Earth System Model Running at Cloud-Resolving Resolutions.","authors":"Luca Bertagna, O. Guba, M. Taylor, J. Foucar, J. Larkin, A. Bradley, S. Rajamanickam, A. Salinger","doi":"10.1109/SC41405.2020.00096","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00096","url":null,"abstract":"We present an effort to port the nonhydrostatic atmosphere dynamical core of the Energy Exascale Earth System Model (E3SM) to efficiently run on a variety of architectures, including conventional CPU, many-core CPU, and GPU. We specifically target cloud-resolving resolutions of 3 km and 1 km. To express on-node parallelism we use the C++ library Kokkos, which allows us to achieve a performance portable code in a largely architecture-independent way. Our C++ implementation is at least as fast as the original Fortran implementation on IBM Power9 and Intel Knights Landing processors, proving that the code refactor did not compromise the efficiency on CPU architectures. On the other hand, when using the GPUs, our implementation is able to achieve 0.97 Simulated Years Per Day, running on the full Summit supercomputer. To the best of our knowledge, this is the most achieved to date by any global atmosphere dynamical core running at such resolutions.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131013584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/SC41405.2020.00052
K. Pedretti, A. Younge, S. Hammond, J. Laros, M. Curry, Michael J. Aguilar, R. Hoekstra, R. Brightwell
Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.
{"title":"Chronicles of Astra: Challenges and Lessons from the First Petascale Arm Supercomputer","authors":"K. Pedretti, A. Younge, S. Hammond, J. Laros, M. Curry, Michael J. Aguilar, R. Hoekstra, R. Brightwell","doi":"10.1109/SC41405.2020.00052","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00052","url":null,"abstract":"Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"317 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123637078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}