International Journal of High Performance Computing Applications最新文献_第6页

AI4IO: A suite of AI-based tools for IO-aware scheduling AI4IO：一套用于IO感知调度的基于AI的工具

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-04-03 DOI: 10.1177/10943420221079765

Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer

Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.

传统的工作负载管理器没有能力考虑IO争用如何增加作业运行时，甚至导致浪费整个资源分配。无论是IO需求爆发还是并行文件系统(PFS)性能下降，都必须识别和解决IO争用，以确保最大性能。在本文中，我们提出了AI4IO (AI for IO)，这是一套使用AI方法来防止和减轻由于IO争用而导致的性能损失的工具。AI4IO使现有的工作负载管理器能够感知io。目前，AI4IO包括两个工具:PRIONN和CanarIO。PRIONN预测IO争用，并授权调度器防止它。CanarIO在IO争用发生时减轻了它的影响。我们测量了AI4IO集成到Flux(下一代调度器)中时的有效性，用于小型和大型io密集型工作负载。我们的结果表明，将AI4IO集成到Flux中可以将工作负载的makespan提高6.4%，在我们的大规模工作负载中，这可以在生产集群上每周节省超过18,000 node-h的资源。

{"title":"AI4IO: A suite of AI-based tools for IO-aware scheduling","authors":"Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer","doi":"10.1177/10943420221079765","DOIUrl":"https://doi.org/10.1177/10943420221079765","url":null,"abstract":"Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"370 - 387"},"PeriodicalIF":3.1,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45876634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An analytical performance model of generalized hierarchical scheduling 广义分层调度的分析性能模型

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-03-26 DOI: 10.1177/10943420211051039

Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer

High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R2 statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.

高性能计算（HPC）工作流程正在经历动荡的变化，包括规模和复杂性的爆炸式增长。尽管有这些变化，大多数批处理作业系统仍然使用缓慢、集中的调度器。广义层次调度（GHS）解决了现代工作流面临的许多挑战，但GHS尚未在HPC中得到广泛应用。阻碍采用的一个主要困难是缺乏性能模型来帮助配置GHS以在给定应用程序上获得最佳性能。我们提出了GHS的分析性能模型，并在中等规模的系统上用四种不同的应用验证了我们提出的模型。我们的验证表明，我们的模型在预测GHS的性能方面非常准确，解释了98.7%的方差（即R2统计量为0.987）。我们的结果也支持GHS克服了调度吞吐量问题的说法；我们在中等规模的系统上测量到吞吐量提高了270倍。然后，我们将我们的性能模型应用于预exascale系统，在该系统中，我们的模型预测吞吐量将提高四个数量级，并深入了解下一代系统上的GHS优化配置。

{"title":"An analytical performance model of generalized hierarchical scheduling","authors":"Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer","doi":"10.1177/10943420211051039","DOIUrl":"https://doi.org/10.1177/10943420211051039","url":null,"abstract":"High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R2 statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"289 - 306"},"PeriodicalIF":3.1,"publicationDate":"2022-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42629053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Development of NCL equivalent serial and parallel python routines for meteorological data analysis 用于气象数据分析的NCL等效串行和并行python例程的开发

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-03-22 DOI: 10.1177/10943420221077110

Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne

The NCAR Command Language (NCL) is a popular scripting language used in the geoscience community for weather data analysis and visualization. Hundreds of years of data are analyzed daily using NCL to make accurate weather predictions. However, due to its sequential nature of execution, it cannot properly utilize the parallel processing power provided by High-Performance Computing systems (HPCs). Until now very few techniques have been developed to make use of the multi-core functionality of modern HPC systems on these functions. In the recent trend, open-source languages are becoming highly popular because they support major functionalities required for data analysis and parallel computing. Hence, developers of NCL have decided to adopt Python as the future scripting language for analysis and visualization and to enable the geosciences community to play an active role in its development and support. This study focuses on developing some of the widely used NCL routines in Python. To deal with the analysis of large datasets, parallel versions of these routines are developed to work within a single node and make use of multi-core CPUs to achieve parallelism. Results show high accuracy between NCL and Python outputs and the parallel versions provided good scaling compared to their sequential counterparts.

NCAR命令语言(NCL)是一种流行的脚本语言，用于地球科学社区的天气数据分析和可视化。每天使用NCL分析数百年的数据，以做出准确的天气预测。然而，由于其执行的顺序性，它不能正确地利用高性能计算系统(hpc)提供的并行处理能力。到目前为止，很少有技术能够利用现代高性能计算系统在这些功能上的多核功能。在最近的趋势中，开源语言正变得非常流行，因为它们支持数据分析和并行计算所需的主要功能。因此，NCL的开发人员决定采用Python作为未来的分析和可视化脚本语言，并使地球科学社区能够在其开发和支持中发挥积极作用。本研究的重点是在Python中开发一些广泛使用的NCL例程。为了处理大型数据集的分析，开发了这些例程的并行版本，以便在单个节点内工作，并利用多核cpu来实现并行性。结果显示，NCL和Python输出之间的准确性很高，与顺序版本相比，并行版本提供了良好的可伸缩性。

{"title":"Development of NCL equivalent serial and parallel python routines for meteorological data analysis","authors":"Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne","doi":"10.1177/10943420221077110","DOIUrl":"https://doi.org/10.1177/10943420221077110","url":null,"abstract":"The NCAR Command Language (NCL) is a popular scripting language used in the geoscience community for weather data analysis and visualization. Hundreds of years of data are analyzed daily using NCL to make accurate weather predictions. However, due to its sequential nature of execution, it cannot properly utilize the parallel processing power provided by High-Performance Computing systems (HPCs). Until now very few techniques have been developed to make use of the multi-core functionality of modern HPC systems on these functions. In the recent trend, open-source languages are becoming highly popular because they support major functionalities required for data analysis and parallel computing. Hence, developers of NCL have decided to adopt Python as the future scripting language for analysis and visualization and to enable the geosciences community to play an active role in its development and support. This study focuses on developing some of the widely used NCL routines in Python. To deal with the analysis of large datasets, parallel versions of these routines are developed to work within a single node and make use of multi-core CPUs to achieve parallelism. Results show high accuracy between NCL and Python outputs and the parallel versions provided good scaling compared to their sequential counterparts.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"337 - 355"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43857226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Productively accelerating positron emission tomography image reconstruction on graphics processing units with Julia 在图形处理单元上使用Julia有效地加速正电子发射断层成像图像重建

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-03-22 DOI: 10.1177/10943420211067520

Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter

Research in medical imaging is hampered by a lack of programming languages that support productive, flexible programming as well as high performance. In search for higher quality imaging, researchers can ideally experiment with novel algorithms using rapid-prototyping languages such as Python. However, to speed up image reconstruction, computational resources such as those of graphics processing units (GPUs) need to be used efficiently. Doing so requires re-programming the algorithms in lower-level programming languages such as CUDA C/C++ or rephrasing them in terms of existing implementations of established algorithms in libraries. The former has a detrimental impact on research productivity and requires system-level programming expertise, and the latter puts severe constraints on the flexibility to research novel algorithms. Here, we investigate the use of the Julia scientific programming language in the domain of PET image reconstruction as a means to obtain both high performance (portability) on GPUs and high programmer productivity and flexibility, all at once, without requiring expert GPU programming knowledge. Using rapid-prototyping features of Julia, we developed basic and performance-optimized GPU implementations of baseline maximum likelihood expectation maximization (MLEM) positron emission tomography (PET) image reconstruction algorithms, as well as multiple existing algorithmic extensions. Thus, we mimic the effort that researchers would have to invest to evaluate the quality and performance potential of algorithms. We evaluate the obtained performance and compare it to state-of-the-art existing implementations. We also analyse and compare the required programming effort. With the Julia implementations, performance in line with existing GPU implementations written in the low-level, unproductive programming language CUDA C is achieved, while requiring much less programming effort, even less than what is needed for much less performant CPU implementations in C++. Switching to Julia as the programming language of choice can therefore boost the productivity of research into medical imaging and deliver excellent performance at a low cost in terms of programming effort.

由于缺乏支持高效、灵活和高性能编程的编程语言，医学成像研究受到了阻碍。为了寻找更高质量的成像，研究人员可以理想地使用快速原型语言(如Python)进行新算法的实验。然而，为了加快图像重建的速度，需要有效地利用图形处理器(gpu)等计算资源。这样做需要在较低级别的编程语言(如CUDA C/ c++)中重新编程算法，或者根据库中已建立算法的现有实现来重新表述它们。前者对研究效率有不利影响，需要系统级编程专业知识，后者严重限制了研究新算法的灵活性。在这里，我们研究了Julia科学编程语言在PET图像重建领域的使用，作为在GPU上获得高性能(可移植性)和高程序员生产力和灵活性的一种手段，同时不需要专业的GPU编程知识。利用Julia的快速原型功能，我们开发了基线最大似然期望最大化(MLEM)正电子发射断层扫描(PET)图像重建算法的基本和性能优化的GPU实现，以及多个现有算法扩展。因此，我们模仿研究人员必须投入的努力来评估算法的质量和性能潜力。我们评估获得的性能，并将其与最先进的现有实现进行比较。我们还分析和比较了所需的编程工作。使用Julia实现，性能与现有的GPU实现一致，用低级，非生产性编程语言CUDA C编写，同时需要更少的编程工作，甚至比在c++中性能更低的CPU实现所需的更少。因此，切换到Julia作为选择的编程语言可以提高医学成像研究的生产力，并在编程工作方面以低成本提供出色的性能。

{"title":"Productively accelerating positron emission tomography image reconstruction on graphics processing units with Julia","authors":"Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter","doi":"10.1177/10943420211067520","DOIUrl":"https://doi.org/10.1177/10943420211067520","url":null,"abstract":"Research in medical imaging is hampered by a lack of programming languages that support productive, flexible programming as well as high performance. In search for higher quality imaging, researchers can ideally experiment with novel algorithms using rapid-prototyping languages such as Python. However, to speed up image reconstruction, computational resources such as those of graphics processing units (GPUs) need to be used efficiently. Doing so requires re-programming the algorithms in lower-level programming languages such as CUDA C/C++ or rephrasing them in terms of existing implementations of established algorithms in libraries. The former has a detrimental impact on research productivity and requires system-level programming expertise, and the latter puts severe constraints on the flexibility to research novel algorithms. Here, we investigate the use of the Julia scientific programming language in the domain of PET image reconstruction as a means to obtain both high performance (portability) on GPUs and high programmer productivity and flexibility, all at once, without requiring expert GPU programming knowledge. Using rapid-prototyping features of Julia, we developed basic and performance-optimized GPU implementations of baseline maximum likelihood expectation maximization (MLEM) positron emission tomography (PET) image reconstruction algorithms, as well as multiple existing algorithmic extensions. Thus, we mimic the effort that researchers would have to invest to evaluate the quality and performance potential of algorithms. We evaluate the obtained performance and compare it to state-of-the-art existing implementations. We also analyse and compare the required programming effort. With the Julia implementations, performance in line with existing GPU implementations written in the low-level, unproductive programming language CUDA C is achieved, while requiring much less programming effort, even less than what is needed for much less performant CPU implementations in C++. Switching to Julia as the programming language of choice can therefore boost the productivity of research into medical imaging and deliver excellent performance at a low cost in terms of programming effort.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"320 - 336"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"65398900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient high-precision integer multiplication on the GPU 高效高精度的整数乘法在GPU上

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-03-20 DOI: 10.1177/10943420221077964

A. P. Diéguez, M. Amor, R. Doallo, A. Nukada, S. Matsuoka

The multiplication of large integers, which has many applications in computer science, is an operation that can be expressed as a polynomial multiplication followed by a carry normalization. This work develops two approaches for efficient polynomial multiplication: one approach is based on tiling the classical convolution algorithm, but taking advantage of new CUDA architectures, a novelty approach to compute the multiplication using integers without accuracy lossless; the other one is based on the Strassen algorithm, an algorithm that multiplies large polynomials using the FFT operation, but adapting the fastest FFT libraries for current GPUs and working on the complex field. Previous studies reported that the Strassen algorithm is an effective implementation for “large enough” integers on GPUs. Additionally, most previous studies do not examine the implementation of the carry normalization, but this work describes a parallel implementation for this operation. Our results show the efficiency of our approaches for short, medium, and large sizes.

大整数的乘法运算在计算机科学中有许多应用，它可以表示为一个多项式乘法，后跟一个进位归一化。这项工作开发了两种有效的多项式乘法方法:一种方法是基于经典卷积算法的平铺，但利用新的CUDA架构，一种新颖的方法来计算使用整数的乘法而没有精度损失;另一种是基于Strassen算法，一种使用FFT操作乘大多项式的算法，但为当前gpu适应最快的FFT库，并在复杂领域工作。先前的研究报告称，Strassen算法是gpu上“足够大”整数的有效实现。此外，大多数先前的研究没有检查进位归一化的实现，但这项工作描述了该操作的并行实现。我们的结果显示了我们的方法对短、中、大尺寸的有效性。

引用次数: 1

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance 在超越FP32理论峰值性能的同时，从张量核心恢复单精度精度

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-03-07 DOI: 10.1177/10943420221090256

Hiroyuki Ootomo, Rio Yokota

Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.

Tensor Core是NVIDIA GPU上的混合精度矩阵-矩阵乘法单元，在Ampere架构上的理论峰值性能超过300 TFlop/s。张量核是为了响应机器学习对密集矩阵乘法的高需求而开发的。然而，科学计算中的许多应用，如迭代求解器的预处理器和低精度傅立叶变换，都可以利用这些张量核。要在张量核上计算矩阵乘法，我们需要将输入矩阵转换为半精度，这会导致精度损失。为了避免这种情况，我们可以使用额外的半精度变量来保持转换中的尾数损失，并使用它们来校正矩阵-矩阵乘法的精度。即使进行了这种校正，与FP32 SIMT核心相比，张量核心的使用也会产生更高的吞吐量。然而，仅此方法的校正能力是有限的，并且由此产生的精度不能与FP32 SIMT核心上的矩阵乘法的精度相匹配。我们解决了这个问题，并开发了一种使用张量核心的高精度、高性能和低功耗矩阵-矩阵乘法实现，它与FP32 SIMT核心的精度完全匹配，同时实现了卓越的吞吐量。该实现基于NVIDIA的CUTRASS。我们发现，实现这种精度的关键是如何在校正计算过程中处理张量核内的舍入和下溢概率。我们的实现在NVIDIA A100 GPU上使用FP16张量核在有限指数范围内实现了51TFlop/s，使用TF32张量核在FP32的全指数范围内达到了33TFlop/s，这优于理论上的FP32 SIMT核19.5TFlop/s。

{"title":"Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1177/10943420221090256","DOIUrl":"https://doi.org/10.1177/10943420221090256","url":null,"abstract":"Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"475 - 491"},"PeriodicalIF":3.1,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42971490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Parthenon—a performance portable block-structured adaptive mesh refinement framework Parthenon——一种性能可移植的块结构自适应网格细化框架

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-02-24 DOI: 10.1177/10943420221143775

P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone

On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at ≈ 92 % weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.

在扩大规模的道路上，计算机设备架构和相应的编程模型变得更加多样化。虽然可以使用各种低级别性能的可移植编程模型，但应用程序级别的支持却落后了。为了解决这个问题，我们提出了性能可移植的块结构自适应网格细化（AMR）框架Parthenon，该框架源于测试良好且广泛使用的Athena++天体物理磁流体动力学代码，但被推广为各种下游多物理代码的基础。Parthenon采用了Kokkos编程模型，并提供了从多维变量到定义和分离组件的包，再到启动并行计算内核的各种抽象级别。Parthenon分配设备内存中的所有数据以减少数据移动，支持变量和网格块的逻辑打包以减少内核启动开销，并采用单侧异步MPI调用来减少多节点模拟中的通信开销。使用流体动力学迷你应用程序，我们展示了在各种架构上的弱扩展和强扩展，包括AMD和NVIDIA GPU、Intel和AMD x86 CPU、IBM Power9 CPU以及富士通A64FX CPU。在Frontier（第一台TOP500 exascale机器）的最大规模下，该小型应用程序在9216个节点（73728个逻辑GPU）上以≈92%的弱扩展并行效率（从单个节点开始）达到了1.7×1013个区域循环/s。再加上它是一个开放、协作的项目，这使Parthenon成为一个理想的框架，可以针对exascale模拟，在该模拟中，下游开发人员可以专注于他们的特定应用程序，而不是处理大规模并行、设备加速的AMR的复杂性。

{"title":"Parthenon—a performance portable block-structured adaptive mesh refinement framework","authors":"P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone","doi":"10.1177/10943420221143775","DOIUrl":"https://doi.org/10.1177/10943420221143775","url":null,"abstract":"On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at ≈ 92 % weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"465 - 486"},"PeriodicalIF":3.1,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43605690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Large-scale ab initio simulation of light–matter interaction at the atomic scale in Fugaku Fugaku原子尺度上光与物质相互作用的大规模从头算模拟

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-01-02 DOI: 10.1177/10943420211065723

Yuta Hirokawa, A. Yamada, S. Yamada, M. Noda, M. Uemoto, T. Boku, K. Yabana

In the field of optical science, it is becoming increasingly important to observe and manipulate matter at the atomic scale using ultrashort pulsed light. For the first time, we have performed the ab initio simulation solving the Maxwell equation for light electromagnetic fields, the time-dependent Kohn-Sham equation for electrons, and the Newton equation for ions in extended systems. In the simulation, the most time-consuming parts were stencil and nonlocal pseudopotential operations on the electron orbitals as well as fast Fourier transforms for the electron density. Code optimization was thoroughly performed on the Fujitsu A64FX processor to achieve the highest performance. A simulation of amorphous SiO2 thin film composed of more than 10,000 atoms was performed using 27,648 nodes of the Fugaku supercomputer. The simulation achieved excellent time-to-solution with the performance close to the maximum possible value in view of the memory bandwidth bound, as well as excellent weak scalability.

在光学科学领域，利用超短脉冲光在原子尺度上观察和操纵物质变得越来越重要。我们首次在扩展系统中对光电磁场的麦克斯韦方程、电子的含时Kohn-Sham方程和离子的牛顿方程进行了从头算模拟求解。在模拟中，最耗时的部分是电子轨道上的模板和非局部赝势运算，以及电子密度的快速傅立叶变换。代码优化是在富士通A64FX处理器上彻底执行，以实现最高性能。使用Fugaku超级计算机的27648个节点对由10000多个原子组成的非晶SiO2薄膜进行了模拟。仿真实现了出色的解决时间，鉴于内存带宽限制，性能接近最大可能值，以及出色的弱可扩展性。

引用次数: 3

High Performance Computing: 8th Latin American Conference, CARLA 2021, Guadalajara, Mexico, October 6–8, 2021, Revised Selected Papers 高性能计算:第八届拉丁美洲会议，卡拉2021，瓜达拉哈拉，墨西哥，10月6-8日，2021，修订论文选集

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-01-01 DOI: 10.1007/978-3-031-04209-6

引用次数: 1

ExaAM: Metal additive manufacturing simulation at the fidelity of the microstructure 在微观结构保真度的金属增材制造模拟

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2022-01-01 DOI: 10.1177/10943420211042558

J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump

Additive manufacturing (AM), or 3D printing, of metals is transforming the fabrication of components, in part by dramatically expanding the design space, allowing optimization of shape and topology. However, although the physical processes involved in AM are similar to those of welding, a field with decades of experimental, modeling, simulation, and characterization experience, qualification of AM parts remains a challenge. The availability of exascale computational systems, particularly when combined with data-driven approaches such as machine learning, enables topology and shape optimization as well as accelerated qualification by providing process-aware, locally accurate microstructure and mechanical property models. We describe the physics components comprising the Exascale Additive Manufacturing simulation environment and report progress using highly resolved melt pool simulations to inform part-scale finite element thermomechanics simulations, drive microstructure evolution, and determine constitutive mechanical property relationships based on those microstructures using polycrystal plasticity. We report on implementation of these components for exascale computing architectures, as well as the multi-stage simulation workflow that provides a unique high-fidelity model of process–structure–property relationships for AM parts. In addition, we discuss verification and validation through collaboration with efforts such as AM-Bench, a set of benchmark test problems under development by a team led by the National Institute of Standards and Technology.

金属的增材制造（AM）或3D打印正在改变部件的制造，部分原因是大幅扩大了设计空间，允许优化形状和拓扑结构。然而，尽管AM所涉及的物理过程与焊接相似，焊接是一个拥有数十年实验、建模、模拟和表征经验的领域，但AM零件的鉴定仍然是一个挑战。exascale计算系统的可用性，特别是与机器学习等数据驱动方法相结合时，通过提供过程感知、局部准确的微观结构和机械性能模型，可以实现拓扑和形状优化，并加速鉴定。我们描述了组成Exascale增材制造模拟环境的物理组件，并报告了使用高分辨率熔池模拟的进展，以告知零件级有限元热机械模拟，驱动微观结构演变，并使用多晶体塑性基于这些微观结构确定本构力学性能关系。我们报告了用于exascale计算架构的这些组件的实现，以及为AM零件提供独特的过程-结构-特性关系高保真模型的多阶段模拟工作流。此外，我们还通过与AM Bench等机构的合作讨论了验证和确认问题，AM Bench是一组由美国国家标准与技术研究所领导的团队正在开发的基准测试问题。

{"title":"ExaAM: Metal additive manufacturing simulation at the fidelity of the microstructure","authors":"J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump","doi":"10.1177/10943420211042558","DOIUrl":"https://doi.org/10.1177/10943420211042558","url":null,"abstract":"Additive manufacturing (AM), or 3D printing, of metals is transforming the fabrication of components, in part by dramatically expanding the design space, allowing optimization of shape and topology. However, although the physical processes involved in AM are similar to those of welding, a field with decades of experimental, modeling, simulation, and characterization experience, qualification of AM parts remains a challenge. The availability of exascale computational systems, particularly when combined with data-driven approaches such as machine learning, enables topology and shape optimization as well as accelerated qualification by providing process-aware, locally accurate microstructure and mechanical property models. We describe the physics components comprising the Exascale Additive Manufacturing simulation environment and report progress using highly resolved melt pool simulations to inform part-scale finite element thermomechanics simulations, drive microstructure evolution, and determine constitutive mechanical property relationships based on those microstructures using polycrystal plasticity. We report on implementation of these components for exascale computing architectures, as well as the multi-stage simulation workflow that provides a unique high-fidelity model of process–structure–property relationships for AM parts. In addition, we discuss verification and validation through collaboration with efforts such as AM-Bench, a set of benchmark test problems under development by a team led by the National Institute of Standards and Technology.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"13 - 39"},"PeriodicalIF":3.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44573865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13