首页 > 最新文献

International Journal of High Performance Computing Applications最新文献

英文 中文
Corrigendum to ‘Unprecedented cloud resolution in a GPU-enabled full-physics atmospheric climate simulation on OLCF’s summit supercomputer’ 更正“OLCF峰会超级计算机上GPU支持的全物理大气气候模拟中前所未有的云分辨率”
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-07-01 DOI: 10.1177/10943420221103014
M. Norman
{"title":"Corrigendum to ‘Unprecedented cloud resolution in a GPU-enabled full-physics atmospheric climate simulation on OLCF’s summit supercomputer’","authors":"M. Norman","doi":"10.1177/10943420221103014","DOIUrl":"https://doi.org/10.1177/10943420221103014","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"564 - 564"},"PeriodicalIF":3.1,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47297709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran 使用gpu和现代Fortran的大规模直接数值模拟湍流
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-06-23 DOI: 10.1177/10943420231158616
Martin Karp, D. Massaro, Niclas Jansson, A. Hart, Jacob Wahlgren, P. Schlatter, S. Markidis
We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.
我们介绍了我们在可持续航运中应用的湍流直接数值模拟方法。我们使用现代Fortran和谱元素方法来利用和扩展由英伟达A100和最近的AMD Instinct MI250X GPU提供动力的超级计算机,同时仍然为用Fortran开发的用户软件提供支持。我们通过对Re=30000时Flettner转子周围的流动及其与湍流边界层的相互作用进行世界上第一次直接数值模拟,证明了我们的方法的有效性。我们展示了AMD Instinct MI250X和Nvidia A100 GPU之间的可扩展计算流体动力学性能比较。我们的研究结果表明,一个MI250X的性能与两个A100 GPU不相上下,并且基于片上能量传感器的读数,其功率效率相似。
{"title":"Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran","authors":"Martin Karp, D. Massaro, Niclas Jansson, A. Hart, Jacob Wahlgren, P. Schlatter, S. Markidis","doi":"10.1177/10943420231158616","DOIUrl":"https://doi.org/10.1177/10943420231158616","url":null,"abstract":"We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"487 - 502"},"PeriodicalIF":3.1,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42194182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Accelerating physics simulations with tensor processing units: An inundation modeling example 用张量处理单元加速物理模拟:一个洪水建模的例子
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-06-03 DOI: 10.1177/10943420221102873
R. Hu, D. Pierce, Yusef Shafi, Anudhyan Boral, V. Anisimov, Sella Nevo, Yi-Fan Chen
Recent advancements in hardware accelerators such as Tensor Processing Units (TPUs) speed up computation time relative to Central Processing Units (CPUs) not only for machine learning but, as demonstrated here, also for scientific modeling and computer simulations. To study TPU hardware for distributed scientific computing, we solve partial differential equations (PDEs) for the physics simulation of fluids to model riverine floods. We demonstrate that TPUs achieve a two orders of magnitude speedup over CPUs. Running physics simulations on TPUs is publicly accessible via the Google Cloud Platform, and we release a Python interactive notebook version of the simulation.
硬件加速器的最新进展,如张量处理单元(tpu),相对于中央处理单元(cpu),不仅加快了机器学习的计算时间,而且如本文所示,也加快了科学建模和计算机模拟的计算时间。为了研究用于分布式科学计算的TPU硬件,我们求解流体物理模拟的偏微分方程(PDEs)来模拟河流洪水。我们证明tpu比cpu实现了两个数量级的加速。在tpu上运行物理模拟可以通过谷歌云平台公开访问,我们发布了模拟的Python交互式笔记本版本。
{"title":"Accelerating physics simulations with tensor processing units: An inundation modeling example","authors":"R. Hu, D. Pierce, Yusef Shafi, Anudhyan Boral, V. Anisimov, Sella Nevo, Yi-Fan Chen","doi":"10.1177/10943420221102873","DOIUrl":"https://doi.org/10.1177/10943420221102873","url":null,"abstract":"Recent advancements in hardware accelerators such as Tensor Processing Units (TPUs) speed up computation time relative to Central Processing Units (CPUs) not only for machine learning but, as demonstrated here, also for scientific modeling and computer simulations. To study TPU hardware for distributed scientific computing, we solve partial differential equations (PDEs) for the physics simulation of fluids to model riverine floods. We demonstrate that TPUs achieve a two orders of magnitude speedup over CPUs. Running physics simulations on TPUs is publicly accessible via the Google Cloud Platform, and we release a Python interactive notebook version of the simulation.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"510 - 523"},"PeriodicalIF":3.1,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41458120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics 从头算分子动力学中电子结构问题的e级突破
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-05-24 DOI: 10.1177/10943420231177631
Robert Schade, Tobias Kenter, Hossam Elgabarty, Michael Lass, T. Kühne, Christian Plessl
The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms.
应用于基于电子结构的分子动力学模拟的非正交局部子矩阵法在使用4400 NVIDIA A100 gpu的Perlmutter系统的FP16/ fp32混合浮点运算中超过1.1 EFLOP/s。这是通过对原始方法的修改来实现的,该方法将峰值性能的持续分数提高到80%左右。对具有多达8300万个原子的SARS-CoV-2刺突蛋白进行了示例计算。
{"title":"Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics","authors":"Robert Schade, Tobias Kenter, Hossam Elgabarty, Michael Lass, T. Kühne, Christian Plessl","doi":"10.1177/10943420231177631","DOIUrl":"https://doi.org/10.1177/10943420231177631","url":null,"abstract":"The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"530 - 538"},"PeriodicalIF":3.1,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46337516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations 提高高阶无矩阵有限元实现的共轭梯度法的数据局部性
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-05-18 DOI: 10.1177/10943420221107880
M. Kronbichler, D. Sashko, Peter Munch
This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the CG method with the proposed performance optimizations is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit.
本文研究了共轭梯度(CG)方法的一种变体,并将其嵌入到具有快速无矩阵算子求值和廉价前置条件(如矩阵对角)的高阶有限元格式中。依靠数据依赖分析和适当的自由度枚举,我们将矢量更新和内部乘积与矩阵矢量乘积在CG迭代中交织在一起,只有很小的组织开销。因此,CG方法的三个活动向量的大约90%的向量条目在每次迭代中从慢速RAM存储器中传输一次,而所有额外的访问都要访问快速缓存存储器。节点级性能分析和高达147k核的缩放研究表明,具有所提出的性能优化的CG方法比标准CG求解器以及优化的流水线CG和s-step CG方法快两倍左右,并且在接近强缩放限制的情况下提供类似的性能。
{"title":"Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations","authors":"M. Kronbichler, D. Sashko, Peter Munch","doi":"10.1177/10943420221107880","DOIUrl":"https://doi.org/10.1177/10943420221107880","url":null,"abstract":"This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the CG method with the proposed performance optimizations is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"61 - 81"},"PeriodicalIF":3.1,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44370332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Performance analysis of relaxation Runge–Kutta methods 松弛龙格-库塔方法的性能分析
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-05-12 DOI: 10.1177/10943420221085947
M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes
Recently, global and local relaxation Runge–Kutta methods have been developed for guaranteeing the conservation, dissipation, or other solution properties for general convex functionals whose dynamics are crucial for an ordinary differential equation solution. These novel time integration procedures have an application in a wide range of problems that require dynamics-consistent and stable numerical methods. The application of a relaxation scheme involves solving scalar nonlinear algebraic equations to find the relaxation parameter. Even though root-finding may seem to be a problem technically straightforward and computationally insignificant, we address the problem at scale as we solve full-scale industrial problems on a CPU-powered supercomputer and show its cost to be considerable. In particular, we apply the relaxation schemes in the context of the compressible Navier–Stokes equations and use them to enforce the correct entropy evolution. We use seven different algorithms to solve for the global and local relaxation parameters and analyze their strong scalability. As a result of this analysis, within the global relaxation scheme, we recommend using Brent’s method for problems with a low polynomial degree and of small sizes for the global relaxation scheme, while secant proves to be the best choice for higher polynomial degree solutions and large problem sizes. For the local relaxation scheme, we recommend secant. Further, we compare the schemes’ performance using their most efficient implementations, where we look at their effect on the timestep size, overhead, and weak scalability. We show the global relaxation scheme to be always more expensive than the local approach—typically 1.1–1.5 times the cost. At the same time, we highlight scenarios where the global relaxation scheme might underperform due to its increased communication requirements. Finally, we present an analysis that sets expectations on the computational overhead anticipated based on the system properties.
近年来,为了保证一般凸泛函的守恒性、耗散性或其他解的性质,人们发展了整体松弛和局部松弛龙格-库塔方法。这些新颖的时间积分方法在许多需要动态一致和稳定数值方法的问题中具有广泛的应用。松弛格式的应用涉及求解标量非线性代数方程来求松弛参数。尽管查找根在技术上似乎是一个简单的问题,在计算上微不足道,但我们在cpu驱动的超级计算机上解决了全面的工业问题,并表明其成本相当可观。特别是,我们在可压缩Navier-Stokes方程的背景下应用松弛方案,并使用它们来强制执行正确的熵演化。我们使用了七种不同的算法来求解全局和局部松弛参数,并分析了它们的强可扩展性。根据分析结果,在全局松弛方案中,我们建议对全局松弛方案的低多项式次解和小尺寸问题使用Brent方法,而对于高多项式次解和大尺寸问题,割线法被证明是最佳选择。对于局部松弛方案,我们建议使用割线。此外,我们使用它们最有效的实现来比较方案的性能,其中我们查看它们对时间步长、开销和弱可伸缩性的影响。我们发现全局松弛方案总是比局部方法更昂贵——通常是成本的1.1-1.5倍。同时,我们强调了全局放松方案可能由于其增加的通信需求而表现不佳的情况。最后,我们给出了一个基于系统属性对预期的计算开销设置期望的分析。
{"title":"Performance analysis of relaxation Runge–Kutta methods","authors":"M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes","doi":"10.1177/10943420221085947","DOIUrl":"https://doi.org/10.1177/10943420221085947","url":null,"abstract":"Recently, global and local relaxation Runge–Kutta methods have been developed for guaranteeing the conservation, dissipation, or other solution properties for general convex functionals whose dynamics are crucial for an ordinary differential equation solution. These novel time integration procedures have an application in a wide range of problems that require dynamics-consistent and stable numerical methods. The application of a relaxation scheme involves solving scalar nonlinear algebraic equations to find the relaxation parameter. Even though root-finding may seem to be a problem technically straightforward and computationally insignificant, we address the problem at scale as we solve full-scale industrial problems on a CPU-powered supercomputer and show its cost to be considerable. In particular, we apply the relaxation schemes in the context of the compressible Navier–Stokes equations and use them to enforce the correct entropy evolution. We use seven different algorithms to solve for the global and local relaxation parameters and analyze their strong scalability. As a result of this analysis, within the global relaxation scheme, we recommend using Brent’s method for problems with a low polynomial degree and of small sizes for the global relaxation scheme, while secant proves to be the best choice for higher polynomial degree solutions and large problem sizes. For the local relaxation scheme, we recommend secant. Further, we compare the schemes’ performance using their most efficient implementations, where we look at their effect on the timestep size, overhead, and weak scalability. We show the global relaxation scheme to be always more expensive than the local approach—typically 1.1–1.5 times the cost. At the same time, we highlight scenarios where the global relaxation scheme might underperform due to its increased communication requirements. Finally, we present an analysis that sets expectations on the computational overhead anticipated based on the system properties.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"524 - 542"},"PeriodicalIF":3.1,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44137336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Very fast finite element Poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100 低精度加速器硬件上的快速有限元泊松求解器:Nvidia Tesla V100的概念验证研究
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-05-06 DOI: 10.1177/10943420221084657
D. Ruda, S. Turek, D. Ribbrock, P. Zajác
Recently, accelerator hardware in the form of graphics cards including Tensor Cores, specialized for AI, has significantly gained importance in the domain of high-performance computing. For example, NVIDIA’s Tesla V100 promises a computing power of up to 125 TFLOP/s achieved by Tensor Cores, but only if half precision floating point format is used. We describe the difficulties and discrepancy between theoretical and actual computing power if one seeks to use such hardware for numerical simulations, that is, solving partial differential equations with a matrix-based finite element method, with numerical examples. If certain requirements, namely low condition numbers and many dense matrix operations, are met, the indicated high performance can be reached without an excessive loss of accuracy. A new method to solve linear systems arising from Poisson’s equation in 2D that meets these requirements, based on “prehandling” by means of hier-archical finite elements and an additional Schur complement approach, is presented and analyzed. We provide numerical results illustrating the computational performance of this method and compare it to a commonly used (geometric) multigrid solver on standard hardware. It turns out that we can exploit nearly the full computational power of Tensor Cores and achieve a significant speed-up compared to the standard methodology without losing accuracy.
最近,图形卡形式的加速器硬件,包括专门用于人工智能的Tensor Core,在高性能计算领域获得了显著的重要性。例如,NVIDIA的Tesla V100承诺Tensor Cores可实现高达125 TFLOP/s的计算能力,但前提是使用半精度浮点格式。如果试图使用这种硬件进行数值模拟,即使用基于矩阵的有限元方法求解偏微分方程,我们将描述理论计算能力和实际计算能力之间的困难和差异,并提供数值示例。如果满足某些要求,即低条件数和许多密集矩阵运算,则可以在不过度损失精度的情况下达到所指示的高性能。基于层次有限元的“预处理”和附加的Schur补方法,提出并分析了一种求解二维泊松方程线性系统的新方法,该方法满足了这些要求。我们提供了数值结果,说明了该方法的计算性能,并将其与标准硬件上常用的(几何)多重网格求解器进行了比较。事实证明,与标准方法相比,我们几乎可以充分利用张量核的计算能力,并在不损失准确性的情况下实现显著的加速。
{"title":"Very fast finite element Poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100","authors":"D. Ruda, S. Turek, D. Ribbrock, P. Zajác","doi":"10.1177/10943420221084657","DOIUrl":"https://doi.org/10.1177/10943420221084657","url":null,"abstract":"Recently, accelerator hardware in the form of graphics cards including Tensor Cores, specialized for AI, has significantly gained importance in the domain of high-performance computing. For example, NVIDIA’s Tesla V100 promises a computing power of up to 125 TFLOP/s achieved by Tensor Cores, but only if half precision floating point format is used. We describe the difficulties and discrepancy between theoretical and actual computing power if one seeks to use such hardware for numerical simulations, that is, solving partial differential equations with a matrix-based finite element method, with numerical examples. If certain requirements, namely low condition numbers and many dense matrix operations, are met, the indicated high performance can be reached without an excessive loss of accuracy. A new method to solve linear systems arising from Poisson’s equation in 2D that meets these requirements, based on “prehandling” by means of hier-archical finite elements and an additional Schur complement approach, is presented and analyzed. We provide numerical results illustrating the computational performance of this method and compare it to a commonly used (geometric) multigrid solver on standard hardware. It turns out that we can exploit nearly the full computational power of Tensor Cores and achieve a significant speed-up compared to the standard methodology without losing accuracy.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"459 - 474"},"PeriodicalIF":3.1,"publicationDate":"2022-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43024478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Performance portability in a real world application: PHAST applied to Caffe 实际应用程序中的性能可移植性:PHAST应用于Caffe
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-05-01 DOI: 10.1177/10943420221077107
Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé
This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.
这项工作涵盖了PHAST库的使用,一个与硬件无关的编程库,到像Caffe框架这样的现实世界应用程序。Caffe的原始实现由两个不同版本的源代码组成:一个在CPU平台上运行,另一个在GPU端运行。通过PHAST,我们的目标是开发一个能够在CPU和GPU上高效运行的单一源代码实现。在本文中,我们首先使用PHAST进行基本的Caffe实现性能分析。然后,我们详细介绍可能的性能升级。我们发现,整体性能主要由几个“重”层决定。在改进该版本的低效部分时,我们发现了两种不同的方法:对Caffe源代码的改进和对PHAST库本身的改进,这最终转化为改进Caffe的PHAST版本的性能。我们证明了我们的PHAST实现在CPU和GPU上实现了性能可移植性。使用单一源代码,Caffe的PHAST版本提供了与由两个不同代码库构建的Caffe原始版本相同甚至更好的性能。对于MNIST数据库,PHAST实现所花费的时间与CPU和GPU中的本机代码相当。此外,使用CIFAR-10数据库,PHAST相对于CPU和GPU中的本地代码分别实现了51%和49%的加速。这些结果为即将到来的异构计算时代的软件开发提供了一个新的视角。
{"title":"Performance portability in a real world application: PHAST applied to Caffe","authors":"Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé","doi":"10.1177/10943420221077107","DOIUrl":"https://doi.org/10.1177/10943420221077107","url":null,"abstract":"This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"419 - 439"},"PeriodicalIF":3.1,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49259162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Performance portable ice-sheet modeling with MALI 性能便携式冰盖建模与马里
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-04-08 DOI: 10.1177/10943420231183688
Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price
High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.
极地冰盖的高分辨率模拟在为概率海平面预测开发更精确和可靠的地球系统模型的持续努力中发挥着至关重要的作用。这些模拟通常需要来自大型超级计算集群的大量内存和计算来提供足够的精度和分辨率;因此,确保这些平台上的性能变得至关重要。当今的许多超级计算机都包含多种计算体系结构,并且需要特定的编程接口才能获得最佳效率。为了避免特定于体系结构的编程并保持跨平台的生产力,被称为MPAS-Albany Land Ice (MALI)的冰盖建模代码使用高级抽象来集成Trilinos库和Kokkos编程模型,以便在各种不同的体系结构中实现性能可移植代码。在本文中,我们通过对当前基于cpu和gpu的超级计算机的性能分析来分析MALI的性能可移植特性。分析不仅强调了MALI在有限元组装和多网格预处理方面的性能改进,在CPU和GPU架构上的速度提高了1.26到1.82倍,而且还确定了进一步提高GPU软件耦合和预处理性能的需求。我们进行了弱可扩展性研究,并表明在使用gpu时,基于gpu的机器上的模拟速度提高了1.24 - 1.92倍。在有限元组件上表现最好,在gpu上实现了高达8.65倍的加速和82.6%的弱缩放效率。我们还描述了一个使用变更点检测方法为这个代码库开发的自动化性能测试框架。该框架用于就马里的绩效做出可操作的决策。我们提供了几个具体的场景示例,其中框架在2年的开发过程中发现了性能回归、改进和算法差异。
{"title":"Performance portable ice-sheet modeling with MALI","authors":"Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price","doi":"10.1177/10943420231183688","DOIUrl":"https://doi.org/10.1177/10943420231183688","url":null,"abstract":"High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"600 - 625"},"PeriodicalIF":3.1,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41395091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
AI4IO: A suite of AI-based tools for IO-aware scheduling AI4IO:一套用于IO感知调度的基于AI的工具
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2022-04-03 DOI: 10.1177/10943420221079765
Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer
Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.
传统的工作负载管理器没有能力考虑IO争用如何增加作业运行时,甚至导致浪费整个资源分配。无论是IO需求爆发还是并行文件系统(PFS)性能下降,都必须识别和解决IO争用,以确保最大性能。在本文中,我们提出了AI4IO (AI for IO),这是一套使用AI方法来防止和减轻由于IO争用而导致的性能损失的工具。AI4IO使现有的工作负载管理器能够感知io。目前,AI4IO包括两个工具:PRIONN和CanarIO。PRIONN预测IO争用,并授权调度器防止它。CanarIO在IO争用发生时减轻了它的影响。我们测量了AI4IO集成到Flux(下一代调度器)中时的有效性,用于小型和大型io密集型工作负载。我们的结果表明,将AI4IO集成到Flux中可以将工作负载的makespan提高6.4%,在我们的大规模工作负载中,这可以在生产集群上每周节省超过18,000 node-h的资源。
{"title":"AI4IO: A suite of AI-based tools for IO-aware scheduling","authors":"Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer","doi":"10.1177/10943420221079765","DOIUrl":"https://doi.org/10.1177/10943420221079765","url":null,"abstract":"Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"370 - 387"},"PeriodicalIF":3.1,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45876634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
International Journal of High Performance Computing Applications
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1