首页 > 最新文献

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers 等待游戏:为云调度程序优化配置固定资源
Pradeep Ambati, Noman Bashir, D. Irwin, P. Shenoy
While cloud platforms enable users to rent computing resources on demand to execute their jobs, buying fixed resources is still much cheaper than renting if their utilization is high. Thus, optimizing cloud costs requires users to determine how many fixed resources to buy versus rent based on their workload. In this paper, we introduce the concept of a waiting policy for cloud-enabled schedulers, which is the dual of a scheduling policy, and show that the optimal cost depends on it. We define multiple waiting policies and develop simple analytical models to reveal their tradeoff between fixed resource provisioning, cost, and job waiting time. We evaluate the impact of these waiting policies on a year-long production batch workload consisting of 14Mjobs run on a 14.3k-core cluster, and show that a compound waiting policy decreases the cost (by 5%) and mean job waiting time (by 7×) compared to a fixed cluster of the current size.
虽然云平台使用户可以按需租用计算资源来执行他们的工作,但如果固定资源的利用率很高,购买固定资源仍然比租用便宜得多。因此,优化云成本要求用户根据自己的工作负载确定购买多少固定资源,租用多少固定资源。在本文中,我们引入了云调度程序的等待策略的概念,这是调度策略的对偶,并证明了最优成本取决于它。我们定义了多个等待策略,并开发了简单的分析模型,以揭示它们在固定资源供应、成本和作业等待时间之间的权衡。我们评估了这些等待策略对在14.3k-core集群上运行的由14mjob组成的一年生产批工作负载的影响,并表明与当前规模的固定集群相比,复合等待策略降低了成本(减少了5%)和平均作业等待时间(减少了7倍)。
{"title":"Waiting Game: Optimally Provisioning Fixed Resources for Cloud-Enabled Schedulers","authors":"Pradeep Ambati, Noman Bashir, D. Irwin, P. Shenoy","doi":"10.1109/SC41405.2020.00071","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00071","url":null,"abstract":"While cloud platforms enable users to rent computing resources on demand to execute their jobs, buying fixed resources is still much cheaper than renting if their utilization is high. Thus, optimizing cloud costs requires users to determine how many fixed resources to buy versus rent based on their workload. In this paper, we introduce the concept of a waiting policy for cloud-enabled schedulers, which is the dual of a scheduling policy, and show that the optimal cost depends on it. We define multiple waiting policies and develop simple analytical models to reveal their tradeoff between fixed resource provisioning, cost, and job waiting time. We evaluate the impact of these waiting policies on a year-long production batch workload consisting of 14Mjobs run on a 14.3k-core cluster, and show that a compound waiting policy decreases the cost (by 5%) and mean job waiting time (by 7×) compared to a fixed cluster of the current size.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"355 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132794934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
INEC: Fast and Coherent In-Network Erasure Coding 快速和一致的网络内擦除编码
Haiyang Shi, Xiaoyi Lu
Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and 49.55%, respectively.
Erasure coding (EC)是一种很有前途的容错方案,已应用于许多知名的分布式存储系统中。现代智能网卡上相干电子计算和联网的能力表明,电子计算将成为网络计算的基本特征。在本文中,我们提出了一组一致的网络内EC原语,称为INEC。我们基于提出的α-β性能模型的分析表明,INEC原语可以使不同类型的EC方案充分利用现代smartnic上的EC卸载能力。我们在商用RDMA网卡上实施INEC,并将其集成到五个最先进的EC方案中。我们的实验表明,INEC原语显著降低了第50、95和99百分位延迟,并将与INEC共同设计的键值存储的端到端吞吐量、写入和读取性能分别提高了99.57%、47.30%和49.55%。
{"title":"INEC: Fast and Coherent In-Network Erasure Coding","authors":"Haiyang Shi, Xiaoyi Lu","doi":"10.1109/SC41405.2020.00070","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00070","url":null,"abstract":"Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and 49.55%, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130978045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Compiling Generalized Histograms for GPU 编译通用直方图的GPU
Troels Henriksen, Sune Hellfritzsch, P. Sadayappan, C. Oancea
We present and evaluate an implementation technique for histogram-like computations on GPUs that ensures both work-efficient asymptotic cost, support for arbitrary associative and commutative operators, and efficient use of hardwaresupported atomic operations when applicable. Based on a systematic empirical examination of the design space, we develop a technique that balances conflict rates and memory footprint. We demonstrate our technique both as a library implementation in CUDA, as well as by extending the parallel array language Futhark with a new construct for expressing generalized histograms, and by supporting this construct with several compiler optimizations. We show that our histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.
我们提出并评估了一种在gpu上实现类似直方图计算的技术,该技术确保了高效的渐近成本,支持任意关联和交换运算符,并在适用时有效地使用硬件支持的原子操作。基于对设计空间的系统经验检查,我们开发了一种平衡冲突率和内存占用的技术。我们演示了我们的技术作为CUDA中的库实现,以及通过扩展并行数组语言Futhark的新结构来表达广义直方图,并通过几个编译器优化来支持该结构。我们展示了单独使用的直方图实现优于CUB中的类似原语,并且可以与几个应用程序基准测试的手写代码竞争或优于它们,即使后者专门用于一类数据集。
{"title":"Compiling Generalized Histograms for GPU","authors":"Troels Henriksen, Sune Hellfritzsch, P. Sadayappan, C. Oancea","doi":"10.1109/SC41405.2020.00101","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00101","url":null,"abstract":"We present and evaluate an implementation technique for histogram-like computations on GPUs that ensures both work-efficient asymptotic cost, support for arbitrary associative and commutative operators, and efficient use of hardwaresupported atomic operations when applicable. Based on a systematic empirical examination of the design space, we develop a technique that balances conflict rates and memory footprint. We demonstrate our technique both as a library implementation in CUDA, as well as by extending the parallel array language Futhark with a new construct for expressing generalized histograms, and by supporting this construct with several compiler optimizations. We show that our histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125128638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation 基于320亿网格有限元计算的壁面分辨大涡模拟拖曳水池数值试验的实现
C. Kato, Y. Yamade, K. Nagano, Kiyoshi Kumahata, K. Minami, Tatsuo Nishikawa
To realize numerical towing-tank tests by substantially shortening the time to the solution, a general-purpose Finite-Element flow solver, named FrontFlow/blue (FFB), has been fully optimized so as to achieve maximum possible sustained memory throughputs with three of its four hot kernels. A single-node sustained performance of 179.0 GFLOPS, which corresponds to 5.3% of the peak performance, has been achieved on Fugaku, the next flagship computer of Japan. A weak-scale benchmark test has confirmed that FFB runs with a parallel efficiency of over 85% up to 5,505,024 compute cores, and an overall sustained performance of16.7 PFLOPS has been achieved. As a result, the time needed for large-eddy simulation using 32 billion grids has been significantly reduced from almost two days to only 37 min., or by a factor of 71. This has clearly indicated that a numerical towing-tank could actually be built for ship hydrodynamics within a few years.
为了通过大幅缩短求解时间来实现数值拖曳槽测试,一个名为FrontFlow/blue (FFB)的通用有限元流求解器已经进行了全面优化,以便通过其四个热内核中的三个实现最大可能的持续内存吞吐量。在日本下一代旗舰计算机Fugaku上实现了179.0 GFLOPS的单节点持续性能,相当于峰值性能的5.3%。弱规模基准测试已经证实,FFB在5,505,024个计算核心上以超过85%的并行效率运行,并实现了16.7 PFLOPS的整体持续性能。因此,使用320亿个网格进行大涡模拟所需的时间已从近两天显著减少到仅37分钟,即71倍。这清楚地表明,在几年内,实际上可以建造一个用于船舶流体力学的数值拖曳箱。
{"title":"Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation","authors":"C. Kato, Y. Yamade, K. Nagano, Kiyoshi Kumahata, K. Minami, Tatsuo Nishikawa","doi":"10.1109/SC41405.2020.00007","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00007","url":null,"abstract":"To realize numerical towing-tank tests by substantially shortening the time to the solution, a general-purpose Finite-Element flow solver, named FrontFlow/blue (FFB), has been fully optimized so as to achieve maximum possible sustained memory throughputs with three of its four hot kernels. A single-node sustained performance of 179.0 GFLOPS, which corresponds to 5.3% of the peak performance, has been achieved on Fugaku, the next flagship computer of Japan. A weak-scale benchmark test has confirmed that FFB runs with a parallel efficiency of over 85% up to 5,505,024 compute cores, and an overall sustained performance of16.7 PFLOPS has been achieved. As a result, the time needed for large-eddy simulation using 32 billion grids has been significantly reduced from almost two days to only 37 min., or by a factor of 71. This has clearly indicated that a numerical towing-tank could actually be built for ship hydrodynamics within a few years.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114549674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
OMPRacer: A Scalable and Precise Static Race Detector for OpenMP Programs OMPRacer:一个可扩展的和精确的静态竞赛检测器的OpenMP程序
Bradley Swain, Yanze Li, Peiming Liu, I. Laguna, G. Georgakoudis, Jeff Huang
We present OMPRACER, a static tool that uses flow-sensitive, interprocedural analysis to detect data races in OpenMP programs. OMPRACER is fast, scalable, has high code coverage, and supports the most common OpenMP features by combining state-of-the-art pointer analysis, novel value-flow analysis, happens-before tracking, and generalized modelling of OpenMP APIs. Unlike dynamic tools that currently dominate data race detection, OMPRACER achieves almost 100% code coverage using static analysis to detect a broader category of races without running the program or relying on specific input or runtime behaviour. OMPRACER has competitive precision with dynamic tools like Archer and ROMP: passing 105/116 cases in DataRaceBench with a total accuracy of 91%. OMPRACER has been used to analyze several Exascale Computing Project proxy applications containing over 2 million lines of code in under 10 minutes. OMPRACER has revealed previously unknown races in an ECP proxy app and a production simulation for COVID19.
我们提出了OMPRACER,一个静态工具,使用流量敏感,程序间分析来检测OpenMP程序中的数据竞争。OMPRACER快速、可扩展、具有高代码覆盖率,并通过结合最先进的指针分析、新颖的价值流分析、发生前跟踪和OpenMP api的通用建模来支持最常见的OpenMP特性。与目前主导数据竞争检测的动态工具不同,OMPRACER使用静态分析实现了几乎100%的代码覆盖率,以检测更广泛的竞争类别,而无需运行程序或依赖特定的输入或运行时行为。与Archer和ROMP等动态工具相比,OMPRACER具有竞争力的精度:在DataRaceBench中通过105/116例,总精度为91%。OMPRACER已被用于在10分钟内分析包含超过200万行代码的几个Exascale Computing Project代理应用程序。OMPRACER在ECP代理应用程序和covid - 19生产模拟中揭示了以前未知的种族。
{"title":"OMPRacer: A Scalable and Precise Static Race Detector for OpenMP Programs","authors":"Bradley Swain, Yanze Li, Peiming Liu, I. Laguna, G. Georgakoudis, Jeff Huang","doi":"10.1109/SC41405.2020.00058","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00058","url":null,"abstract":"We present OMPRACER, a static tool that uses flow-sensitive, interprocedural analysis to detect data races in OpenMP programs. OMPRACER is fast, scalable, has high code coverage, and supports the most common OpenMP features by combining state-of-the-art pointer analysis, novel value-flow analysis, happens-before tracking, and generalized modelling of OpenMP APIs. Unlike dynamic tools that currently dominate data race detection, OMPRACER achieves almost 100% code coverage using static analysis to detect a broader category of races without running the program or relying on specific input or runtime behaviour. OMPRACER has competitive precision with dynamic tools like Archer and ROMP: passing 105/116 cases in DataRaceBench with a total accuracy of 91%. OMPRACER has been used to analyze several Exascale Computing Project proxy applications containing over 2 million lines of code in under 10 minutes. OMPRACER has revealed previously unknown races in an ECP proxy app and a production simulation for COVID19.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121993919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
HPC I/O Throughput Bottleneck Analysis with Explainable Local Models 基于可解释局部模型的高性能计算I/O吞吐量瓶颈分析
Mihailo Isakov, Eliakin Del Rosario, S. Madireddy, Prasanna Balaprakash, P. Carns, R. Ross, M. Kinsy
With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years’ worth of Darshan logs from the Argonne Leadership Computing Facility’s Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: a data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs, and interpreting I/O bottlenecks. We find groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyze these groups instead of individual jobs, allowing us to reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.
随着高性能计算(HPC)系统的日益复杂,由于I/O瓶颈,实现高性能可能会变得很困难。我们分析了Argonne Leadership Computing Facility的Theta超级计算机多年来的Darshan日志,以了解I/O吞吐量差的原因。我们介绍Gauge:一个数据驱动的诊断工具,用于探索超级计算作业特征的潜在空间,理解作业集群的行为,并解释I/O瓶颈。我们发现一组作业乍一看是高度异构的,但却共享某些行为,并分析这些组而不是单个作业,使我们能够减少领域专家的工作量并自动化I/O性能分析。我们进行了一个案例研究,在这个案例中,一个使用Gauge的系统所有者能够找到几个不符合常规I/O行为的集群,并在应用程序级别和系统级别上发现了一些潜在的改进。
{"title":"HPC I/O Throughput Bottleneck Analysis with Explainable Local Models","authors":"Mihailo Isakov, Eliakin Del Rosario, S. Madireddy, Prasanna Balaprakash, P. Carns, R. Ross, M. Kinsy","doi":"10.1109/SC41405.2020.00037","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00037","url":null,"abstract":"With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years’ worth of Darshan logs from the Argonne Leadership Computing Facility’s Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: a data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs, and interpreting I/O bottlenecks. We find groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyze these groups instead of individual jobs, allowing us to reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126722574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Pencil: A Pipelined Algorithm for Distributed Stencils 铅笔:分布式模板的流水线算法
Hengjie Wang, Aparna Chandramowlishwaran
Stencil computations are at the core of various Computational Fluid Dynamics (CFD) applications and have been well-studied for several decades. Typically they’re highly memory-bound and as a result, numerous tiling algorithms have been proposed to improve its performance. Although efficient, most of these algorithms are designed for single iteration spaces on shared-memory machines. However, in CFD, we are confronted with multi-block structured girds composed of multiple connected iteration spaces distributed across many nodes.In this paper, we propose a pipelined stencil algorithm called Pencil for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces. Based on an in-depth analysis of cache tiling on a single node, we first identify both the optimal combination of MPI and OpenMP for temporal tiling and the best tiling approach, which outperforms the state-of-the-art automatic parallelization tool Pluto by up to $1.92 times$. Then, we adopt DeepHalo to decouple the multiple connected iteration spaces so that temporal tiling can be applied to each space. Finally, we achieve overlap by pipelining the computation and communication without sacrificing the advantage from temporal cache tiling. Pencil is evaluated using 4 stencils across 6 numerical schemes on two distributed memory machines with Omni-Path and InfiniBand networks. On the Omni-Path system, Pencil exhibits outstanding weak and strong scalability for up to 128 nodes and outperforms MPI+OpenMP Funneled with space tiling by $1.33- 3.41 times$ on a multi-block grid with 32 nodes.
模板计算是各种计算流体动力学(CFD)应用的核心,几十年来已经得到了很好的研究。通常它们是高度内存限制的,因此,已经提出了许多平铺算法来提高其性能。虽然效率很高,但大多数算法都是为共享内存机器上的单个迭代空间设计的。然而,在CFD中,我们面临着由分布在许多节点上的多个连通迭代空间组成的多块结构网格。在本文中,我们提出了一种称为Pencil的分布式存储机器的流水线模板算法,该算法适用于跨越多个迭代空间的实际CFD问题。基于对单个节点上缓存平铺的深入分析,我们首先确定了用于时间平铺的MPI和OpenMP的最佳组合以及最佳平铺方法,其性能优于最先进的自动并行化工具Pluto高达1.92 times$。然后,我们采用DeepHalo对多个连接的迭代空间进行解耦,以便对每个空间进行时间平铺。最后,我们在不牺牲临时缓存平铺优势的情况下,通过流水线化计算和通信实现了重叠。Pencil使用4个模板在两个分布式内存机器上使用Omni-Path和InfiniBand网络进行6个数值方案的评估。在Omni-Path系统上,Pencil在多达128个节点的情况下表现出出色的弱扩展性和强扩展性,在具有32个节点的多块网格上,其性能优于MPI+OpenMP漏斗(带有空间平铺)1.33- 3.41倍。
{"title":"Pencil: A Pipelined Algorithm for Distributed Stencils","authors":"Hengjie Wang, Aparna Chandramowlishwaran","doi":"10.1109/SC41405.2020.00089","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00089","url":null,"abstract":"Stencil computations are at the core of various Computational Fluid Dynamics (CFD) applications and have been well-studied for several decades. Typically they’re highly memory-bound and as a result, numerous tiling algorithms have been proposed to improve its performance. Although efficient, most of these algorithms are designed for single iteration spaces on shared-memory machines. However, in CFD, we are confronted with multi-block structured girds composed of multiple connected iteration spaces distributed across many nodes.In this paper, we propose a pipelined stencil algorithm called Pencil for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces. Based on an in-depth analysis of cache tiling on a single node, we first identify both the optimal combination of MPI and OpenMP for temporal tiling and the best tiling approach, which outperforms the state-of-the-art automatic parallelization tool Pluto by up to $1.92 times$. Then, we adopt DeepHalo to decouple the multiple connected iteration spaces so that temporal tiling can be applied to each space. Finally, we achieve overlap by pipelining the computation and communication without sacrificing the advantage from temporal cache tiling. Pencil is evaluated using 4 stencils across 6 numerical schemes on two distributed memory machines with Omni-Path and InfiniBand networks. On the Omni-Path system, Pencil exhibits outstanding weak and strong scalability for up to 128 nodes and outperforms MPI+OpenMP Funneled with space tiling by $1.33- 3.41 times$ on a multi-block grid with 32 nodes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125809107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Fast Stencil-Code Computation on a Wafer-Scale Processor 基于晶圆级处理器的快速模板代码计算
K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, Michael Morrison, V. Kibardin, Andrey Portnoy, J. Dietiker, M. Syamlal, Michael James
The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a $600times 595times 1536$ mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.
对于PDE代码,基于cpu和gpu的系统的性能通常很低,因为PDE代码必须求解大型、稀疏且通常是结构化的线性方程系统。迭代求解器受到数据移动的限制,无论是在缓存和内存之间,还是在节点之间。在这里,我们描述了在Cerebras systems CS-1上求解此类方程组的方法,CS-1是一种具有良好内存带宽和通信延迟的晶圆级处理器。我们在单晶圆级系统上实现了0.86 PFLOPS,通过BiCGStab解决了一个线性系统,该系统由7点有限差分模板在600 595 1536$网格上产生,实现了大约三分之一的机器峰值性能。我们解释了系统,它的架构和编程,以及它在这个问题和相关问题上的性能。我们讨论了内存容量和浮点精度的问题。我们概述了将这项工作扩展到全面应用的计划。
{"title":"Fast Stencil-Code Computation on a Wafer-Scale Processor","authors":"K. Rocki, D. V. Essendelft, I. Sharapov, R. Schreiber, Michael Morrison, V. Kibardin, Andrey Portnoy, J. Dietiker, M. Syamlal, Michael James","doi":"10.1109/SC41405.2020.00062","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00062","url":null,"abstract":"The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a $600times 595times 1536$ mesh, achieving about one third of the machine’s peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131929303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
A Performance-Portable Nonhydrostatic Atmospheric Dycore for the Energy Exascale Earth System Model Running at Cloud-Resolving Resolutions. 用于在云分辨分辨率下运行的能量百亿亿次地球系统模型的性能便携式非流体静力大气Dycore。
Luca Bertagna, O. Guba, M. Taylor, J. Foucar, J. Larkin, A. Bradley, S. Rajamanickam, A. Salinger
We present an effort to port the nonhydrostatic atmosphere dynamical core of the Energy Exascale Earth System Model (E3SM) to efficiently run on a variety of architectures, including conventional CPU, many-core CPU, and GPU. We specifically target cloud-resolving resolutions of 3 km and 1 km. To express on-node parallelism we use the C++ library Kokkos, which allows us to achieve a performance portable code in a largely architecture-independent way. Our C++ implementation is at least as fast as the original Fortran implementation on IBM Power9 and Intel Knights Landing processors, proving that the code refactor did not compromise the efficiency on CPU architectures. On the other hand, when using the GPUs, our implementation is able to achieve 0.97 Simulated Years Per Day, running on the full Summit supercomputer. To the best of our knowledge, this is the most achieved to date by any global atmosphere dynamical core running at such resolutions.
我们提出了一项努力,将Energy Exascale地球系统模型(E3SM)的非流体静力大气动力核心移植到各种架构上,包括传统CPU,多核CPU和GPU。我们特别针对3公里和1公里的云分辨分辨率。为了表达节点上的并行性,我们使用c++库Kokkos,它允许我们以一种很大程度上与体系结构无关的方式实现性能可移植代码。我们的c++实现至少与IBM Power9和Intel Knights Landing处理器上的原始Fortran实现一样快,这证明代码重构不会影响CPU架构的效率。另一方面,当使用gpu时,我们的实现能够在完整的Summit超级计算机上运行,达到0.97模拟年/天。据我们所知,这是迄今为止以这种分辨率运行的任何全球大气动力核心所取得的最大成就。
{"title":"A Performance-Portable Nonhydrostatic Atmospheric Dycore for the Energy Exascale Earth System Model Running at Cloud-Resolving Resolutions.","authors":"Luca Bertagna, O. Guba, M. Taylor, J. Foucar, J. Larkin, A. Bradley, S. Rajamanickam, A. Salinger","doi":"10.1109/SC41405.2020.00096","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00096","url":null,"abstract":"We present an effort to port the nonhydrostatic atmosphere dynamical core of the Energy Exascale Earth System Model (E3SM) to efficiently run on a variety of architectures, including conventional CPU, many-core CPU, and GPU. We specifically target cloud-resolving resolutions of 3 km and 1 km. To express on-node parallelism we use the C++ library Kokkos, which allows us to achieve a performance portable code in a largely architecture-independent way. Our C++ implementation is at least as fast as the original Fortran implementation on IBM Power9 and Intel Knights Landing processors, proving that the code refactor did not compromise the efficiency on CPU architectures. On the other hand, when using the GPUs, our implementation is able to achieve 0.97 Simulated Years Per Day, running on the full Summit supercomputer. To the best of our knowledge, this is the most achieved to date by any global atmosphere dynamical core running at such resolutions.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131013584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Chronicles of Astra: Challenges and Lessons from the First Petascale Arm Supercomputer 阿斯特拉编年史:挑战和教训,从第一个千兆臂超级计算机
K. Pedretti, A. Younge, S. Hammond, J. Laros, M. Curry, Michael J. Aguilar, R. Hoekstra, R. Brightwell
Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.
Arm处理器已经在高性能计算领域探索了好几年,但是还没有一个支持大规模生产工作负载的可行性证明。在本文中,我们回顾了第一台基于64位Arm处理器的Petascale超级计算机Astra的开发过程,并验证了其运行生产HPC应用程序的能力。通过这个过程,解决了几个不成熟的技术差距,包括软件堆栈启用、大规模Linux错误、热管理问题、电源管理功能和高级容器支持。从这一经验中吸取了一些教训,这些教训有助于Astra的成功部署。这些见解有助于加速部署和成熟其他首次出现的HPC技术。随着Astra现在支持许多用户在数千个节点规模上运行各种生产应用程序,我们相信这构成了强有力的证据,证明Arm是一种可行的技术,即使是最大规模的超级计算机部署。
{"title":"Chronicles of Astra: Challenges and Lessons from the First Petascale Arm Supercomputer","authors":"K. Pedretti, A. Younge, S. Hammond, J. Laros, M. Curry, Michael J. Aguilar, R. Hoekstra, R. Brightwell","doi":"10.1109/SC41405.2020.00052","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00052","url":null,"abstract":"Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"317 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123637078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1