Pub Date : 2022-07-01DOI: 10.1177/10943420221103014
M. Norman
{"title":"Corrigendum to ‘Unprecedented cloud resolution in a GPU-enabled full-physics atmospheric climate simulation on OLCF’s summit supercomputer’","authors":"M. Norman","doi":"10.1177/10943420221103014","DOIUrl":"https://doi.org/10.1177/10943420221103014","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"564 - 564"},"PeriodicalIF":3.1,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47297709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-23DOI: 10.1177/10943420231158616
Martin Karp, D. Massaro, Niclas Jansson, A. Hart, Jacob Wahlgren, P. Schlatter, S. Markidis
We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.
{"title":"Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran","authors":"Martin Karp, D. Massaro, Niclas Jansson, A. Hart, Jacob Wahlgren, P. Schlatter, S. Markidis","doi":"10.1177/10943420231158616","DOIUrl":"https://doi.org/10.1177/10943420231158616","url":null,"abstract":"We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"487 - 502"},"PeriodicalIF":3.1,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42194182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-03DOI: 10.1177/10943420221102873
R. Hu, D. Pierce, Yusef Shafi, Anudhyan Boral, V. Anisimov, Sella Nevo, Yi-Fan Chen
Recent advancements in hardware accelerators such as Tensor Processing Units (TPUs) speed up computation time relative to Central Processing Units (CPUs) not only for machine learning but, as demonstrated here, also for scientific modeling and computer simulations. To study TPU hardware for distributed scientific computing, we solve partial differential equations (PDEs) for the physics simulation of fluids to model riverine floods. We demonstrate that TPUs achieve a two orders of magnitude speedup over CPUs. Running physics simulations on TPUs is publicly accessible via the Google Cloud Platform, and we release a Python interactive notebook version of the simulation.
{"title":"Accelerating physics simulations with tensor processing units: An inundation modeling example","authors":"R. Hu, D. Pierce, Yusef Shafi, Anudhyan Boral, V. Anisimov, Sella Nevo, Yi-Fan Chen","doi":"10.1177/10943420221102873","DOIUrl":"https://doi.org/10.1177/10943420221102873","url":null,"abstract":"Recent advancements in hardware accelerators such as Tensor Processing Units (TPUs) speed up computation time relative to Central Processing Units (CPUs) not only for machine learning but, as demonstrated here, also for scientific modeling and computer simulations. To study TPU hardware for distributed scientific computing, we solve partial differential equations (PDEs) for the physics simulation of fluids to model riverine floods. We demonstrate that TPUs achieve a two orders of magnitude speedup over CPUs. Running physics simulations on TPUs is publicly accessible via the Google Cloud Platform, and we release a Python interactive notebook version of the simulation.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"510 - 523"},"PeriodicalIF":3.1,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41458120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-24DOI: 10.1177/10943420231177631
Robert Schade, Tobias Kenter, Hossam Elgabarty, Michael Lass, T. Kühne, Christian Plessl
The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms.
{"title":"Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics","authors":"Robert Schade, Tobias Kenter, Hossam Elgabarty, Michael Lass, T. Kühne, Christian Plessl","doi":"10.1177/10943420231177631","DOIUrl":"https://doi.org/10.1177/10943420231177631","url":null,"abstract":"The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"530 - 538"},"PeriodicalIF":3.1,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46337516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-18DOI: 10.1177/10943420221107880
M. Kronbichler, D. Sashko, Peter Munch
This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the CG method with the proposed performance optimizations is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit.
{"title":"Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations","authors":"M. Kronbichler, D. Sashko, Peter Munch","doi":"10.1177/10943420221107880","DOIUrl":"https://doi.org/10.1177/10943420221107880","url":null,"abstract":"This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the CG method with the proposed performance optimizations is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"61 - 81"},"PeriodicalIF":3.1,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44370332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-12DOI: 10.1177/10943420221085947
M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes
Recently, global and local relaxation Runge–Kutta methods have been developed for guaranteeing the conservation, dissipation, or other solution properties for general convex functionals whose dynamics are crucial for an ordinary differential equation solution. These novel time integration procedures have an application in a wide range of problems that require dynamics-consistent and stable numerical methods. The application of a relaxation scheme involves solving scalar nonlinear algebraic equations to find the relaxation parameter. Even though root-finding may seem to be a problem technically straightforward and computationally insignificant, we address the problem at scale as we solve full-scale industrial problems on a CPU-powered supercomputer and show its cost to be considerable. In particular, we apply the relaxation schemes in the context of the compressible Navier–Stokes equations and use them to enforce the correct entropy evolution. We use seven different algorithms to solve for the global and local relaxation parameters and analyze their strong scalability. As a result of this analysis, within the global relaxation scheme, we recommend using Brent’s method for problems with a low polynomial degree and of small sizes for the global relaxation scheme, while secant proves to be the best choice for higher polynomial degree solutions and large problem sizes. For the local relaxation scheme, we recommend secant. Further, we compare the schemes’ performance using their most efficient implementations, where we look at their effect on the timestep size, overhead, and weak scalability. We show the global relaxation scheme to be always more expensive than the local approach—typically 1.1–1.5 times the cost. At the same time, we highlight scenarios where the global relaxation scheme might underperform due to its increased communication requirements. Finally, we present an analysis that sets expectations on the computational overhead anticipated based on the system properties.
{"title":"Performance analysis of relaxation Runge–Kutta methods","authors":"M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes","doi":"10.1177/10943420221085947","DOIUrl":"https://doi.org/10.1177/10943420221085947","url":null,"abstract":"Recently, global and local relaxation Runge–Kutta methods have been developed for guaranteeing the conservation, dissipation, or other solution properties for general convex functionals whose dynamics are crucial for an ordinary differential equation solution. These novel time integration procedures have an application in a wide range of problems that require dynamics-consistent and stable numerical methods. The application of a relaxation scheme involves solving scalar nonlinear algebraic equations to find the relaxation parameter. Even though root-finding may seem to be a problem technically straightforward and computationally insignificant, we address the problem at scale as we solve full-scale industrial problems on a CPU-powered supercomputer and show its cost to be considerable. In particular, we apply the relaxation schemes in the context of the compressible Navier–Stokes equations and use them to enforce the correct entropy evolution. We use seven different algorithms to solve for the global and local relaxation parameters and analyze their strong scalability. As a result of this analysis, within the global relaxation scheme, we recommend using Brent’s method for problems with a low polynomial degree and of small sizes for the global relaxation scheme, while secant proves to be the best choice for higher polynomial degree solutions and large problem sizes. For the local relaxation scheme, we recommend secant. Further, we compare the schemes’ performance using their most efficient implementations, where we look at their effect on the timestep size, overhead, and weak scalability. We show the global relaxation scheme to be always more expensive than the local approach—typically 1.1–1.5 times the cost. At the same time, we highlight scenarios where the global relaxation scheme might underperform due to its increased communication requirements. Finally, we present an analysis that sets expectations on the computational overhead anticipated based on the system properties.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"524 - 542"},"PeriodicalIF":3.1,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44137336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-06DOI: 10.1177/10943420221084657
D. Ruda, S. Turek, D. Ribbrock, P. Zajác
Recently, accelerator hardware in the form of graphics cards including Tensor Cores, specialized for AI, has significantly gained importance in the domain of high-performance computing. For example, NVIDIA’s Tesla V100 promises a computing power of up to 125 TFLOP/s achieved by Tensor Cores, but only if half precision floating point format is used. We describe the difficulties and discrepancy between theoretical and actual computing power if one seeks to use such hardware for numerical simulations, that is, solving partial differential equations with a matrix-based finite element method, with numerical examples. If certain requirements, namely low condition numbers and many dense matrix operations, are met, the indicated high performance can be reached without an excessive loss of accuracy. A new method to solve linear systems arising from Poisson’s equation in 2D that meets these requirements, based on “prehandling” by means of hier-archical finite elements and an additional Schur complement approach, is presented and analyzed. We provide numerical results illustrating the computational performance of this method and compare it to a commonly used (geometric) multigrid solver on standard hardware. It turns out that we can exploit nearly the full computational power of Tensor Cores and achieve a significant speed-up compared to the standard methodology without losing accuracy.
{"title":"Very fast finite element Poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100","authors":"D. Ruda, S. Turek, D. Ribbrock, P. Zajác","doi":"10.1177/10943420221084657","DOIUrl":"https://doi.org/10.1177/10943420221084657","url":null,"abstract":"Recently, accelerator hardware in the form of graphics cards including Tensor Cores, specialized for AI, has significantly gained importance in the domain of high-performance computing. For example, NVIDIA’s Tesla V100 promises a computing power of up to 125 TFLOP/s achieved by Tensor Cores, but only if half precision floating point format is used. We describe the difficulties and discrepancy between theoretical and actual computing power if one seeks to use such hardware for numerical simulations, that is, solving partial differential equations with a matrix-based finite element method, with numerical examples. If certain requirements, namely low condition numbers and many dense matrix operations, are met, the indicated high performance can be reached without an excessive loss of accuracy. A new method to solve linear systems arising from Poisson’s equation in 2D that meets these requirements, based on “prehandling” by means of hier-archical finite elements and an additional Schur complement approach, is presented and analyzed. We provide numerical results illustrating the computational performance of this method and compare it to a commonly used (geometric) multigrid solver on standard hardware. It turns out that we can exploit nearly the full computational power of Tensor Cores and achieve a significant speed-up compared to the standard methodology without losing accuracy.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"459 - 474"},"PeriodicalIF":3.1,"publicationDate":"2022-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43024478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-01DOI: 10.1177/10943420221077107
Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé
This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.
{"title":"Performance portability in a real world application: PHAST applied to Caffe","authors":"Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé","doi":"10.1177/10943420221077107","DOIUrl":"https://doi.org/10.1177/10943420221077107","url":null,"abstract":"This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"419 - 439"},"PeriodicalIF":3.1,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49259162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-08DOI: 10.1177/10943420231183688
Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price
High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.
极地冰盖的高分辨率模拟在为概率海平面预测开发更精确和可靠的地球系统模型的持续努力中发挥着至关重要的作用。这些模拟通常需要来自大型超级计算集群的大量内存和计算来提供足够的精度和分辨率;因此,确保这些平台上的性能变得至关重要。当今的许多超级计算机都包含多种计算体系结构,并且需要特定的编程接口才能获得最佳效率。为了避免特定于体系结构的编程并保持跨平台的生产力,被称为MPAS-Albany Land Ice (MALI)的冰盖建模代码使用高级抽象来集成Trilinos库和Kokkos编程模型,以便在各种不同的体系结构中实现性能可移植代码。在本文中,我们通过对当前基于cpu和gpu的超级计算机的性能分析来分析MALI的性能可移植特性。分析不仅强调了MALI在有限元组装和多网格预处理方面的性能改进,在CPU和GPU架构上的速度提高了1.26到1.82倍,而且还确定了进一步提高GPU软件耦合和预处理性能的需求。我们进行了弱可扩展性研究,并表明在使用gpu时,基于gpu的机器上的模拟速度提高了1.24 - 1.92倍。在有限元组件上表现最好,在gpu上实现了高达8.65倍的加速和82.6%的弱缩放效率。我们还描述了一个使用变更点检测方法为这个代码库开发的自动化性能测试框架。该框架用于就马里的绩效做出可操作的决策。我们提供了几个具体的场景示例,其中框架在2年的开发过程中发现了性能回归、改进和算法差异。
{"title":"Performance portable ice-sheet modeling with MALI","authors":"Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price","doi":"10.1177/10943420231183688","DOIUrl":"https://doi.org/10.1177/10943420231183688","url":null,"abstract":"High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"600 - 625"},"PeriodicalIF":3.1,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41395091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-03DOI: 10.1177/10943420221079765
Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer
Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.
传统的工作负载管理器没有能力考虑IO争用如何增加作业运行时,甚至导致浪费整个资源分配。无论是IO需求爆发还是并行文件系统(PFS)性能下降,都必须识别和解决IO争用,以确保最大性能。在本文中,我们提出了AI4IO (AI for IO),这是一套使用AI方法来防止和减轻由于IO争用而导致的性能损失的工具。AI4IO使现有的工作负载管理器能够感知io。目前,AI4IO包括两个工具:PRIONN和CanarIO。PRIONN预测IO争用,并授权调度器防止它。CanarIO在IO争用发生时减轻了它的影响。我们测量了AI4IO集成到Flux(下一代调度器)中时的有效性,用于小型和大型io密集型工作负载。我们的结果表明,将AI4IO集成到Flux中可以将工作负载的makespan提高6.4%,在我们的大规模工作负载中,这可以在生产集群上每周节省超过18,000 node-h的资源。
{"title":"AI4IO: A suite of AI-based tools for IO-aware scheduling","authors":"Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer","doi":"10.1177/10943420221079765","DOIUrl":"https://doi.org/10.1177/10943420221079765","url":null,"abstract":"Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"370 - 387"},"PeriodicalIF":3.1,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45876634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}