Pub Date : 2022-04-03DOI: 10.1177/10943420221079765
Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer
Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.
传统的工作负载管理器没有能力考虑IO争用如何增加作业运行时,甚至导致浪费整个资源分配。无论是IO需求爆发还是并行文件系统(PFS)性能下降,都必须识别和解决IO争用,以确保最大性能。在本文中,我们提出了AI4IO (AI for IO),这是一套使用AI方法来防止和减轻由于IO争用而导致的性能损失的工具。AI4IO使现有的工作负载管理器能够感知io。目前,AI4IO包括两个工具:PRIONN和CanarIO。PRIONN预测IO争用,并授权调度器防止它。CanarIO在IO争用发生时减轻了它的影响。我们测量了AI4IO集成到Flux(下一代调度器)中时的有效性,用于小型和大型io密集型工作负载。我们的结果表明,将AI4IO集成到Flux中可以将工作负载的makespan提高6.4%,在我们的大规模工作负载中,这可以在生产集群上每周节省超过18,000 node-h的资源。
{"title":"AI4IO: A suite of AI-based tools for IO-aware scheduling","authors":"Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer","doi":"10.1177/10943420221079765","DOIUrl":"https://doi.org/10.1177/10943420221079765","url":null,"abstract":"Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"370 - 387"},"PeriodicalIF":3.1,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45876634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-26DOI: 10.1177/10943420211051039
Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer
High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R2 statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.
{"title":"An analytical performance model of generalized hierarchical scheduling","authors":"Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer","doi":"10.1177/10943420211051039","DOIUrl":"https://doi.org/10.1177/10943420211051039","url":null,"abstract":"High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R2 statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"289 - 306"},"PeriodicalIF":3.1,"publicationDate":"2022-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42629053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-22DOI: 10.1177/10943420221077110
Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne
The NCAR Command Language (NCL) is a popular scripting language used in the geoscience community for weather data analysis and visualization. Hundreds of years of data are analyzed daily using NCL to make accurate weather predictions. However, due to its sequential nature of execution, it cannot properly utilize the parallel processing power provided by High-Performance Computing systems (HPCs). Until now very few techniques have been developed to make use of the multi-core functionality of modern HPC systems on these functions. In the recent trend, open-source languages are becoming highly popular because they support major functionalities required for data analysis and parallel computing. Hence, developers of NCL have decided to adopt Python as the future scripting language for analysis and visualization and to enable the geosciences community to play an active role in its development and support. This study focuses on developing some of the widely used NCL routines in Python. To deal with the analysis of large datasets, parallel versions of these routines are developed to work within a single node and make use of multi-core CPUs to achieve parallelism. Results show high accuracy between NCL and Python outputs and the parallel versions provided good scaling compared to their sequential counterparts.
{"title":"Development of NCL equivalent serial and parallel python routines for meteorological data analysis","authors":"Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne","doi":"10.1177/10943420221077110","DOIUrl":"https://doi.org/10.1177/10943420221077110","url":null,"abstract":"The NCAR Command Language (NCL) is a popular scripting language used in the geoscience community for weather data analysis and visualization. Hundreds of years of data are analyzed daily using NCL to make accurate weather predictions. However, due to its sequential nature of execution, it cannot properly utilize the parallel processing power provided by High-Performance Computing systems (HPCs). Until now very few techniques have been developed to make use of the multi-core functionality of modern HPC systems on these functions. In the recent trend, open-source languages are becoming highly popular because they support major functionalities required for data analysis and parallel computing. Hence, developers of NCL have decided to adopt Python as the future scripting language for analysis and visualization and to enable the geosciences community to play an active role in its development and support. This study focuses on developing some of the widely used NCL routines in Python. To deal with the analysis of large datasets, parallel versions of these routines are developed to work within a single node and make use of multi-core CPUs to achieve parallelism. Results show high accuracy between NCL and Python outputs and the parallel versions provided good scaling compared to their sequential counterparts.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"337 - 355"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43857226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-22DOI: 10.1177/10943420211067520
Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter
Research in medical imaging is hampered by a lack of programming languages that support productive, flexible programming as well as high performance. In search for higher quality imaging, researchers can ideally experiment with novel algorithms using rapid-prototyping languages such as Python. However, to speed up image reconstruction, computational resources such as those of graphics processing units (GPUs) need to be used efficiently. Doing so requires re-programming the algorithms in lower-level programming languages such as CUDA C/C++ or rephrasing them in terms of existing implementations of established algorithms in libraries. The former has a detrimental impact on research productivity and requires system-level programming expertise, and the latter puts severe constraints on the flexibility to research novel algorithms. Here, we investigate the use of the Julia scientific programming language in the domain of PET image reconstruction as a means to obtain both high performance (portability) on GPUs and high programmer productivity and flexibility, all at once, without requiring expert GPU programming knowledge. Using rapid-prototyping features of Julia, we developed basic and performance-optimized GPU implementations of baseline maximum likelihood expectation maximization (MLEM) positron emission tomography (PET) image reconstruction algorithms, as well as multiple existing algorithmic extensions. Thus, we mimic the effort that researchers would have to invest to evaluate the quality and performance potential of algorithms. We evaluate the obtained performance and compare it to state-of-the-art existing implementations. We also analyse and compare the required programming effort. With the Julia implementations, performance in line with existing GPU implementations written in the low-level, unproductive programming language CUDA C is achieved, while requiring much less programming effort, even less than what is needed for much less performant CPU implementations in C++. Switching to Julia as the programming language of choice can therefore boost the productivity of research into medical imaging and deliver excellent performance at a low cost in terms of programming effort.
{"title":"Productively accelerating positron emission tomography image reconstruction on graphics processing units with Julia","authors":"Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter","doi":"10.1177/10943420211067520","DOIUrl":"https://doi.org/10.1177/10943420211067520","url":null,"abstract":"Research in medical imaging is hampered by a lack of programming languages that support productive, flexible programming as well as high performance. In search for higher quality imaging, researchers can ideally experiment with novel algorithms using rapid-prototyping languages such as Python. However, to speed up image reconstruction, computational resources such as those of graphics processing units (GPUs) need to be used efficiently. Doing so requires re-programming the algorithms in lower-level programming languages such as CUDA C/C++ or rephrasing them in terms of existing implementations of established algorithms in libraries. The former has a detrimental impact on research productivity and requires system-level programming expertise, and the latter puts severe constraints on the flexibility to research novel algorithms. Here, we investigate the use of the Julia scientific programming language in the domain of PET image reconstruction as a means to obtain both high performance (portability) on GPUs and high programmer productivity and flexibility, all at once, without requiring expert GPU programming knowledge. Using rapid-prototyping features of Julia, we developed basic and performance-optimized GPU implementations of baseline maximum likelihood expectation maximization (MLEM) positron emission tomography (PET) image reconstruction algorithms, as well as multiple existing algorithmic extensions. Thus, we mimic the effort that researchers would have to invest to evaluate the quality and performance potential of algorithms. We evaluate the obtained performance and compare it to state-of-the-art existing implementations. We also analyse and compare the required programming effort. With the Julia implementations, performance in line with existing GPU implementations written in the low-level, unproductive programming language CUDA C is achieved, while requiring much less programming effort, even less than what is needed for much less performant CPU implementations in C++. Switching to Julia as the programming language of choice can therefore boost the productivity of research into medical imaging and deliver excellent performance at a low cost in terms of programming effort.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"320 - 336"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"65398900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-20DOI: 10.1177/10943420221077964
A. P. Diéguez, M. Amor, R. Doallo, A. Nukada, S. Matsuoka
The multiplication of large integers, which has many applications in computer science, is an operation that can be expressed as a polynomial multiplication followed by a carry normalization. This work develops two approaches for efficient polynomial multiplication: one approach is based on tiling the classical convolution algorithm, but taking advantage of new CUDA architectures, a novelty approach to compute the multiplication using integers without accuracy lossless; the other one is based on the Strassen algorithm, an algorithm that multiplies large polynomials using the FFT operation, but adapting the fastest FFT libraries for current GPUs and working on the complex field. Previous studies reported that the Strassen algorithm is an effective implementation for “large enough” integers on GPUs. Additionally, most previous studies do not examine the implementation of the carry normalization, but this work describes a parallel implementation for this operation. Our results show the efficiency of our approaches for short, medium, and large sizes.
{"title":"Efficient high-precision integer multiplication on the GPU","authors":"A. P. Diéguez, M. Amor, R. Doallo, A. Nukada, S. Matsuoka","doi":"10.1177/10943420221077964","DOIUrl":"https://doi.org/10.1177/10943420221077964","url":null,"abstract":"The multiplication of large integers, which has many applications in computer science, is an operation that can be expressed as a polynomial multiplication followed by a carry normalization. This work develops two approaches for efficient polynomial multiplication: one approach is based on tiling the classical convolution algorithm, but taking advantage of new CUDA architectures, a novelty approach to compute the multiplication using integers without accuracy lossless; the other one is based on the Strassen algorithm, an algorithm that multiplies large polynomials using the FFT operation, but adapting the fastest FFT libraries for current GPUs and working on the complex field. Previous studies reported that the Strassen algorithm is an effective implementation for “large enough” integers on GPUs. Additionally, most previous studies do not examine the implementation of the carry normalization, but this work describes a parallel implementation for this operation. Our results show the efficiency of our approaches for short, medium, and large sizes.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"356 - 369"},"PeriodicalIF":3.1,"publicationDate":"2022-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48674353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-07DOI: 10.1177/10943420221090256
Hiroyuki Ootomo, Rio Yokota
Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.
{"title":"Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1177/10943420221090256","DOIUrl":"https://doi.org/10.1177/10943420221090256","url":null,"abstract":"Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"475 - 491"},"PeriodicalIF":3.1,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42971490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-24DOI: 10.1177/10943420221143775
P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone
On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at ≈ 92 % weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.
在扩大规模的道路上,计算机设备架构和相应的编程模型变得更加多样化。虽然可以使用各种低级别性能的可移植编程模型,但应用程序级别的支持却落后了。为了解决这个问题,我们提出了性能可移植的块结构自适应网格细化(AMR)框架Parthenon,该框架源于测试良好且广泛使用的Athena++天体物理磁流体动力学代码,但被推广为各种下游多物理代码的基础。Parthenon采用了Kokkos编程模型,并提供了从多维变量到定义和分离组件的包,再到启动并行计算内核的各种抽象级别。Parthenon分配设备内存中的所有数据以减少数据移动,支持变量和网格块的逻辑打包以减少内核启动开销,并采用单侧异步MPI调用来减少多节点模拟中的通信开销。使用流体动力学迷你应用程序,我们展示了在各种架构上的弱扩展和强扩展,包括AMD和NVIDIA GPU、Intel和AMD x86 CPU、IBM Power9 CPU以及富士通A64FX CPU。在Frontier(第一台TOP500 exascale机器)的最大规模下,该小型应用程序在9216个节点(73728个逻辑GPU)上以≈92%的弱扩展并行效率(从单个节点开始)达到了1.7×1013个区域循环/s。再加上它是一个开放、协作的项目,这使Parthenon成为一个理想的框架,可以针对exascale模拟,在该模拟中,下游开发人员可以专注于他们的特定应用程序,而不是处理大规模并行、设备加速的AMR的复杂性。
{"title":"Parthenon—a performance portable block-structured adaptive mesh refinement framework","authors":"P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone","doi":"10.1177/10943420221143775","DOIUrl":"https://doi.org/10.1177/10943420221143775","url":null,"abstract":"On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at ≈ 92 % weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"465 - 486"},"PeriodicalIF":3.1,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43605690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-02DOI: 10.1177/10943420211065723
Yuta Hirokawa, A. Yamada, S. Yamada, M. Noda, M. Uemoto, T. Boku, K. Yabana
In the field of optical science, it is becoming increasingly important to observe and manipulate matter at the atomic scale using ultrashort pulsed light. For the first time, we have performed the ab initio simulation solving the Maxwell equation for light electromagnetic fields, the time-dependent Kohn-Sham equation for electrons, and the Newton equation for ions in extended systems. In the simulation, the most time-consuming parts were stencil and nonlocal pseudopotential operations on the electron orbitals as well as fast Fourier transforms for the electron density. Code optimization was thoroughly performed on the Fujitsu A64FX processor to achieve the highest performance. A simulation of amorphous SiO2 thin film composed of more than 10,000 atoms was performed using 27,648 nodes of the Fugaku supercomputer. The simulation achieved excellent time-to-solution with the performance close to the maximum possible value in view of the memory bandwidth bound, as well as excellent weak scalability.
{"title":"Large-scale ab initio simulation of light–matter interaction at the atomic scale in Fugaku","authors":"Yuta Hirokawa, A. Yamada, S. Yamada, M. Noda, M. Uemoto, T. Boku, K. Yabana","doi":"10.1177/10943420211065723","DOIUrl":"https://doi.org/10.1177/10943420211065723","url":null,"abstract":"In the field of optical science, it is becoming increasingly important to observe and manipulate matter at the atomic scale using ultrashort pulsed light. For the first time, we have performed the ab initio simulation solving the Maxwell equation for light electromagnetic fields, the time-dependent Kohn-Sham equation for electrons, and the Newton equation for ions in extended systems. In the simulation, the most time-consuming parts were stencil and nonlocal pseudopotential operations on the electron orbitals as well as fast Fourier transforms for the electron density. Code optimization was thoroughly performed on the Fujitsu A64FX processor to achieve the highest performance. A simulation of amorphous SiO2 thin film composed of more than 10,000 atoms was performed using 27,648 nodes of the Fugaku supercomputer. The simulation achieved excellent time-to-solution with the performance close to the maximum possible value in view of the memory bandwidth bound, as well as excellent weak scalability.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"182 - 197"},"PeriodicalIF":3.1,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46158909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1007/978-3-031-04209-6
{"title":"High Performance Computing: 8th Latin American Conference, CARLA 2021, Guadalajara, Mexico, October 6–8, 2021, Revised Selected Papers","authors":"","doi":"10.1007/978-3-031-04209-6","DOIUrl":"https://doi.org/10.1007/978-3-031-04209-6","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"77 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87050523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1177/10943420211042558
J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump
Additive manufacturing (AM), or 3D printing, of metals is transforming the fabrication of components, in part by dramatically expanding the design space, allowing optimization of shape and topology. However, although the physical processes involved in AM are similar to those of welding, a field with decades of experimental, modeling, simulation, and characterization experience, qualification of AM parts remains a challenge. The availability of exascale computational systems, particularly when combined with data-driven approaches such as machine learning, enables topology and shape optimization as well as accelerated qualification by providing process-aware, locally accurate microstructure and mechanical property models. We describe the physics components comprising the Exascale Additive Manufacturing simulation environment and report progress using highly resolved melt pool simulations to inform part-scale finite element thermomechanics simulations, drive microstructure evolution, and determine constitutive mechanical property relationships based on those microstructures using polycrystal plasticity. We report on implementation of these components for exascale computing architectures, as well as the multi-stage simulation workflow that provides a unique high-fidelity model of process–structure–property relationships for AM parts. In addition, we discuss verification and validation through collaboration with efforts such as AM-Bench, a set of benchmark test problems under development by a team led by the National Institute of Standards and Technology.
{"title":"ExaAM: Metal additive manufacturing simulation at the fidelity of the microstructure","authors":"J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump","doi":"10.1177/10943420211042558","DOIUrl":"https://doi.org/10.1177/10943420211042558","url":null,"abstract":"Additive manufacturing (AM), or 3D printing, of metals is transforming the fabrication of components, in part by dramatically expanding the design space, allowing optimization of shape and topology. However, although the physical processes involved in AM are similar to those of welding, a field with decades of experimental, modeling, simulation, and characterization experience, qualification of AM parts remains a challenge. The availability of exascale computational systems, particularly when combined with data-driven approaches such as machine learning, enables topology and shape optimization as well as accelerated qualification by providing process-aware, locally accurate microstructure and mechanical property models. We describe the physics components comprising the Exascale Additive Manufacturing simulation environment and report progress using highly resolved melt pool simulations to inform part-scale finite element thermomechanics simulations, drive microstructure evolution, and determine constitutive mechanical property relationships based on those microstructures using polycrystal plasticity. We report on implementation of these components for exascale computing architectures, as well as the multi-stage simulation workflow that provides a unique high-fidelity model of process–structure–property relationships for AM parts. In addition, we discuss verification and validation through collaboration with efforts such as AM-Bench, a set of benchmark test problems under development by a team led by the National Institute of Standards and Technology.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"13 - 39"},"PeriodicalIF":3.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44573865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}