International Journal of High Performance Computing Applications最新文献

英文中文

A massively parallel time-domain coupled electrodynamics–micromagnetics solver 大规模并行时域耦合电动力学-微磁学求解器

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-03-23 DOI: 10.1177/10943420211057906

Z. Yao, R. Jambunathan, Yadong Zeng, A. Nonaka

We present a high-performance coupled electrodynamics–micromagnetics solver for full physical modeling of signals in microelectronic circuitry. The overall strategy couples a finite-difference time-domain approach for Maxwell’s equations to a magnetization model described by the Landau–Lifshitz–Gilbert equation. The algorithm is implemented in the Exascale Computing Project software framework, AMReX, which provides effective scalability on manycore and GPU-based supercomputing architectures. Furthermore, the code leverages ongoing developments of the Exascale Application Code, WarpX, which is primarily being developed for plasma wakefield accelerator modeling. Our temporal coupling scheme provides second-order accuracy in space and time by combining the integration steps for the magnetic field and magnetization into an iterative sub-step that includes a trapezoidal temporal discretization for the magnetization. The performance of the algorithm is demonstrated by the excellent scaling results on NERSC multicore and GPU systems, with a significant (59×) speedup on the GPU using a node-by-node comparison. We demonstrate the utility of our code by performing simulations of an electromagnetic waveguide and a magnetically tunable filter.

我们提出了一种高性能的电动力学-微磁耦合求解器，用于微电子电路中信号的全物理建模。总体策略将麦克斯韦方程组的时域有限差分方法与兰多- lifshitz -吉尔伯特方程描述的磁化模型结合起来。该算法在Exascale计算项目软件框架AMReX中实现，该框架在多核和基于gpu的超级计算架构上提供有效的可扩展性。此外，该代码还利用了正在开发的Exascale应用程序代码WarpX，该代码主要用于等离子尾流场加速器建模。我们的时间耦合方案通过将磁场和磁化的积分步骤组合成包含磁化的梯形时间离散化的迭代子步骤，提供了空间和时间上的二阶精度。该算法在NERSC多核和GPU系统上的出色缩放结果证明了该算法的性能，通过逐节点比较，该算法在GPU上的加速显著提高(59倍)。我们通过执行电磁波导和磁可调谐滤波器的仿真来演示我们代码的实用性。

{"title":"A massively parallel time-domain coupled electrodynamics–micromagnetics solver","authors":"Z. Yao, R. Jambunathan, Yadong Zeng, A. Nonaka","doi":"10.1177/10943420211057906","DOIUrl":"https://doi.org/10.1177/10943420211057906","url":null,"abstract":"We present a high-performance coupled electrodynamics–micromagnetics solver for full physical modeling of signals in microelectronic circuitry. The overall strategy couples a finite-difference time-domain approach for Maxwell’s equations to a magnetization model described by the Landau–Lifshitz–Gilbert equation. The algorithm is implemented in the Exascale Computing Project software framework, AMReX, which provides effective scalability on manycore and GPU-based supercomputing architectures. Furthermore, the code leverages ongoing developments of the Exascale Application Code, WarpX, which is primarily being developed for plasma wakefield accelerator modeling. Our temporal coupling scheme provides second-order accuracy in space and time by combining the integration steps for the magnetic field and magnetization into an iterative sub-step that includes a trapezoidal temporal discretization for the magnetization. The performance of the algorithm is demonstrated by the excellent scaling results on NERSC multicore and GPU systems, with a significant (59×) speedup on the GPU using a node-by-node comparison. We demonstrate the utility of our code by performing simulations of an electromagnetic waveguide and a magnetically tunable filter.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"167 - 181"},"PeriodicalIF":3.1,"publicationDate":"2021-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43687761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A survey of numerical linear algebra methods utilizing mixed-precision arithmetic 利用混合精度算术的数值线性代数方法综述

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-03-19 DOI: 10.1177/10943420211003313

A. Abdelfattah, H. Anzt, E. Boman, E. Carson, T. Cojean, J. Dongarra, Alyson Fox, M. Gates, N. Higham, X. Li, J. Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, Barry Smith, K. Swirydowicz, Stephen J. Thomas, S. Tomov, Y. Tsai, U. Yang

The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to scientific computing applications. Especially with the hardware integration of low-precision special-function units designed for machine learning applications, the traditional numerical algorithms community urgently needs to reconsider the floating point formats used in the distinct operations to efficiently leverage the available compute power. In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra problems.

混合精度数值线性代数算法的有效利用可以为科学计算应用提供有吸引力的加速。特别是随着为机器学习应用程序设计的低精度特殊函数单元的硬件集成，传统数值算法界迫切需要重新考虑不同运算中使用的浮点格式，以有效利用可用的计算能力。在这项工作中，我们对混合精度数值线性代数例程进行了全面的调查，包括稠密和稀疏线性代数问题的基本概念、理论背景和实验结果。

引用次数: 57

Data-driven global weather predictions at high resolutions 数据驱动的高分辨率全球天气预报

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-03-09 DOI: 10.1177/10943420211039818

John Taylor, P. Larraondo, B. D. de Supinski

Society has benefited enormously from the continuous advancement in numerical weather prediction that has occurred over many decades driven by a combination of outstanding scientific, computational and technological breakthroughs. Here, we demonstrate that data-driven methods are now positioned to contribute to the next wave of major advances in atmospheric science. We show that data-driven models can predict important meteorological quantities of interest to society such as global high resolution precipitation fields (0.25°) and can deliver accurate forecasts of the future state of the atmosphere without prior knowledge of the laws of physics and chemistry. We also show how these data-driven methods can be scaled to run on supercomputers with up to 1024 modern graphics processing units and beyond resulting in rapid training of data-driven models, thus supporting a cycle of rapid research and innovation. Taken together, these two results illustrate the significant potential of data-driven methods to advance atmospheric science and operational weather forecasting.

几十年来，由于科学、计算和技术方面的重大突破，数值天气预报的不断进步使社会受益匪浅。在这里，我们证明了数据驱动的方法现在可以为大气科学的下一波重大进展做出贡献。我们表明，数据驱动的模型可以预测社会感兴趣的重要气象量，如全球高分辨率降水场(0.25°)，并且可以在不事先了解物理和化学定律的情况下提供对未来大气状态的准确预测。我们还展示了如何将这些数据驱动的方法扩展到具有多达1024个现代图形处理单元的超级计算机上，从而实现数据驱动模型的快速训练，从而支持快速研究和创新的周期。综上所述，这两个结果说明了数据驱动方法在推进大气科学和业务天气预报方面的巨大潜力。

引用次数: 7

Accelerated execution via eager-release of dependencies in task-based workflows 通过在基于任务的工作流中快速释放依赖来加速执行

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-03-03 DOI: 10.1177/1094342021997558

Hatem Elshazly, F. Lordan, J. Ejarque, R. Badia

Task-based programming models offer a flexible way to express the unstructured parallelism patterns of nowadays complex applications. This expressive capability is required to achieve maximum possible performance for applications that are executed in distributed execution platforms. In current task-based workflows, tasks are launched for execution when their data dependencies are satisfied. However, even though the data dependencies of a certain task might have been already produced, the execution of this task will be delayed until its predecessor tasks completely finish their execution. As a consequence of this approach of releasing dependencies, the amount of parallelism inherent in applications is limited and performance improvement opportunities are wasted. To mitigate this limitation, we propose an eager approach for releasing data dependencies. Following this approach, the execution of tasks will not be delayed until their predecessor tasks completely finish their execution, instead, tasks will be launched for execution as soon as their data requirements are available. Hence, more parallelism is exposed and applications can achieve higher levels of performance by overlapping the execution of tasks. Towards achieving this goal, in this paper we propose applying two changes to task-based workflow systems. First, modifying the dependency relationships of tasks to be specified not only in terms of predecessor and successor tasks but also in terms of the data that caused these dependencies. Second, triggering the release of dependencies as soon as a predecessor task generates the output data instead of having to wait until the end of the predecessor execution to release all of its dependencies. We realize this proposal using PyCOMPSs: a task-based programming model for parallelizing Python applications. Our experiments show that using an eager approach for releasing dependencies achieves more than 50% performance improvement in the total execution time as compared to the default approach of releasing dependencies.

基于任务的编程模型提供了一种灵活的方式来表达当今复杂应用程序的非结构化并行模式。对于在分布式执行平台中执行的应用程序，需要这种表达能力来实现最大可能的性能。在当前基于任务的工作流中，当任务的数据依赖性得到满足时，任务就会启动并执行。然而，即使某个任务的数据依赖关系可能已经产生，这个任务的执行将被延迟，直到它的前一个任务完全完成它们的执行。这种释放依赖关系的方法的结果是，应用程序中固有的并行性受到限制，并且浪费了性能改进的机会。为了减轻这种限制，我们提出了一种急于释放数据依赖的方法。通过这种方法，任务的执行不会延迟到其前任任务完全完成执行，而是在数据需求可用时立即启动执行任务。因此，暴露了更多的并行性，应用程序可以通过重叠任务的执行来实现更高级别的性能。为了实现这一目标，在本文中，我们建议对基于任务的工作流系统进行两个更改。首先，修改要指定的任务的依赖关系，不仅要根据前导和后继任务，还要根据导致这些依赖关系的数据。其次，在前驱任务生成输出数据时立即触发依赖项的释放，而不必等到前驱任务执行结束时才释放其所有依赖项。我们使用PyCOMPSs来实现这个提议:PyCOMPSs是一个基于任务的编程模型，用于并行化Python应用程序。我们的实验表明，与默认的释放依赖的方法相比，使用急于释放依赖的方法在总执行时间内实现了50%以上的性能提升。

{"title":"Accelerated execution via eager-release of dependencies in task-based workflows","authors":"Hatem Elshazly, F. Lordan, J. Ejarque, R. Badia","doi":"10.1177/1094342021997558","DOIUrl":"https://doi.org/10.1177/1094342021997558","url":null,"abstract":"Task-based programming models offer a flexible way to express the unstructured parallelism patterns of nowadays complex applications. This expressive capability is required to achieve maximum possible performance for applications that are executed in distributed execution platforms. In current task-based workflows, tasks are launched for execution when their data dependencies are satisfied. However, even though the data dependencies of a certain task might have been already produced, the execution of this task will be delayed until its predecessor tasks completely finish their execution. As a consequence of this approach of releasing dependencies, the amount of parallelism inherent in applications is limited and performance improvement opportunities are wasted. To mitigate this limitation, we propose an eager approach for releasing data dependencies. Following this approach, the execution of tasks will not be delayed until their predecessor tasks completely finish their execution, instead, tasks will be launched for execution as soon as their data requirements are available. Hence, more parallelism is exposed and applications can achieve higher levels of performance by overlapping the execution of tasks. Towards achieving this goal, in this paper we propose applying two changes to task-based workflow systems. First, modifying the dependency relationships of tasks to be specified not only in terms of predecessor and successor tasks but also in terms of the data that caused these dependencies. Second, triggering the release of dependencies as soon as a predecessor task generates the output data instead of having to wait until the end of the predecessor execution to release all of its dependencies. We realize this proposal using PyCOMPSs: a task-based programming model for parallelizing Python applications. Our experiments show that using an eager approach for releasing dependencies achieves more than 50% performance improvement in the total execution time as compared to the default approach of releasing dependencies.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"325 - 343"},"PeriodicalIF":3.1,"publicationDate":"2021-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342021997558","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44714622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Task-parallel in situ temporal compression of large-scale computational fluid dynamics data 大规模计算流体动力学数据的任务并行原位时间压缩

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-03-02 DOI: 10.1177/10943420221085000

Heather Pacella, Alec M. Dunton, A. Doostan, G. Iaccarino

Present day computational fluid dynamics (CFD) simulations generate considerable amounts of data, sometimes on the order of TB/s. Often, a significant fraction of this data is discarded because current storage systems are unable to keep pace. To address this, data compression algorithms can be applied to data arrays containing flow quantities of interest (QoIs) to reduce the overall required storage. The matrix column interpolative decomposition (ID) can be implemented as a type of lossy compression for data matrices that factors the original data matrix into a product of two smaller factor matrices. One of these matrices consists of a subset of the columns of the original data matrix, while the other is a coefficient matrix which approximates the original data matrix columns as linear combinations of the selected columns. Motivating this work is the observation that the structure of ID algorithms makes them well suited for the asynchronous nature of task-based parallelism; they can operate independently on subdomains of the system of interest and, as a result, provide varied levels of compression. Using the task-based Legion programming model, a single-pass ID algorithm (SPID) for CFD applications is implemented. Performance studies, scalability, and the accuracy of the compression algorithm are presented for a benchmark analytical Taylor-Green vortex problem, as well as large-scale implementations of both low and high Reynolds number (Re) compressible Taylor-Green vortices using a high-order Navier-Stokes solver. In the case of the analytical solution, the resulting compressed solution was rank-one, with error on the order of machine precision. For the low-Re vortex, compression factors between 1000 and 10,000 were achieved for errors in the range 10−2–10−3. Similar error values were seen for the high-Re vortex, this time with compression factors between 100 and 1000. Moreover, strong and weak scaling results demonstrate that introducing SPID to solvers leads to negligible increases in runtime.

目前的计算流体动力学(CFD)模拟产生大量数据，有时达到TB/s的数量级。通常，这些数据的很大一部分被丢弃，因为当前的存储系统无法跟上速度。为了解决这个问题，可以将数据压缩算法应用于包含感兴趣流量(qos)的数据阵列，以减少所需的总体存储空间。矩阵列插值分解(ID)可以实现为数据矩阵的一种有损压缩，将原始数据矩阵分解为两个较小的因子矩阵的乘积。其中一个矩阵由原始数据矩阵列的子集组成，而另一个是系数矩阵，它将原始数据矩阵列近似为所选列的线性组合。这项工作的动机是观察到ID算法的结构使它们非常适合基于任务的并行的异步性质;它们可以在感兴趣的系统的子域上独立操作，因此可以提供不同级别的压缩。利用基于任务的军团编程模型，实现了CFD应用的单遍ID算法(SPID)。性能研究，可扩展性和压缩算法的准确性提出了一个基准分析泰勒-格林涡旋问题，以及大规模实现低和高雷诺数(Re)可压缩泰勒-格林涡旋使用高阶Navier-Stokes解算器。在解析解的情况下，得到的压缩解为一级，误差在机器精度的数量级上。对于低re涡旋，误差在10−2-10−3范围内，压缩系数在1000 ~ 10000之间。类似的误差值出现在高re涡旋上，这一次压缩系数在100到1000之间。此外，强缩放和弱缩放结果表明，在求解器中引入SPID导致的运行时增加可以忽略不计。

{"title":"Task-parallel in situ temporal compression of large-scale computational fluid dynamics data","authors":"Heather Pacella, Alec M. Dunton, A. Doostan, G. Iaccarino","doi":"10.1177/10943420221085000","DOIUrl":"https://doi.org/10.1177/10943420221085000","url":null,"abstract":"Present day computational fluid dynamics (CFD) simulations generate considerable amounts of data, sometimes on the order of TB/s. Often, a significant fraction of this data is discarded because current storage systems are unable to keep pace. To address this, data compression algorithms can be applied to data arrays containing flow quantities of interest (QoIs) to reduce the overall required storage. The matrix column interpolative decomposition (ID) can be implemented as a type of lossy compression for data matrices that factors the original data matrix into a product of two smaller factor matrices. One of these matrices consists of a subset of the columns of the original data matrix, while the other is a coefficient matrix which approximates the original data matrix columns as linear combinations of the selected columns. Motivating this work is the observation that the structure of ID algorithms makes them well suited for the asynchronous nature of task-based parallelism; they can operate independently on subdomains of the system of interest and, as a result, provide varied levels of compression. Using the task-based Legion programming model, a single-pass ID algorithm (SPID) for CFD applications is implemented. Performance studies, scalability, and the accuracy of the compression algorithm are presented for a benchmark analytical Taylor-Green vortex problem, as well as large-scale implementations of both low and high Reynolds number (Re) compressible Taylor-Green vortices using a high-order Navier-Stokes solver. In the case of the analytical solution, the resulting compressed solution was rank-one, with error on the order of machine precision. For the low-Re vortex, compression factors between 1000 and 10,000 were achieved for errors in the range 10−2–10−3. Similar error values were seen for the high-Re vortex, this time with compression factors between 100 and 1000. Moreover, strong and weak scaling results demonstrate that introducing SPID to solvers leads to negligible increases in runtime.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"388 - 418"},"PeriodicalIF":3.1,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43904224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction 数值天气和气候预测的高性能计算弹性和容错

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-02-08 DOI: 10.1177/1094342021990433

Tommaso Benacchio, Luca Bonaventura, Mirco Altenbernd, C. Cantwell, P. Düben, M. Gillard, L. Giraud, Dominik Göddeke, E. Raffin, K. Teranishi, N. Wedi

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

数值天气和气候预报精度的提高在很大程度上取决于可用计算能力的增长。随着顶级计算设备的核心数量达到数百万，硬件和软件故障的平均频率增加，迫使用户重新检查他们的算法和系统，以保护模拟不发生故障。本报告调查了硬件、应用级和算法级弹性方法，特别是与时间关键数值天气和气候预测系统相关的方法。分析了适用的现有策略的选择，包括数值方案的插值重新启动和压缩检查点，内存检查点，用户级故障缓解和基于备份的系统方法。数值示例展示了该技术在解决故障方面的性能，特别强调线性系统的迭代求解器，这是大气流体流动求解器的主要内容。讨论了这些策略的潜在影响，并与当前面向百亿亿次的数值天气预报算法和系统的发展有关。分析了弹性战略的绩效、效率和有效性之间的权衡，并概述了对未来发展的一些建议。

{"title":"Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction","authors":"Tommaso Benacchio, Luca Bonaventura, Mirco Altenbernd, C. Cantwell, P. Düben, M. Gillard, L. Giraud, Dominik Göddeke, E. Raffin, K. Teranishi, N. Wedi","doi":"10.1177/1094342021990433","DOIUrl":"https://doi.org/10.1177/1094342021990433","url":null,"abstract":"Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"285 - 311"},"PeriodicalIF":3.1,"publicationDate":"2021-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342021990433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42265908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings 高性能计算:第36届国际会议，ISC高性能2021，虚拟事件，2021年6月24日至7月2日，会议录

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-01-01 DOI: 10.1007/978-3-030-78713-4

引用次数: 0

High Performance Computing: 7th Latin American Conference, CARLA 2020, Cuenca, Ecuador, September 2–4, 2020, Revised Selected Papers 高性能计算:第七届拉丁美洲会议，卡拉2020，昆卡，厄瓜多尔，2020年9月2-4日，修订论文选集

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2021-01-01 DOI: 10.1007/978-3-030-68035-0

引用次数: 0

Point-block incomplete LU preconditioning with asynchronous iterations on GPU for multiphysics problems 多物理场问题的GPU异步迭代点块不完全LU预处理

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2020-12-28 DOI: 10.1177/1094342020981153

Wenpeng Ma, X. Cai

Point-block matrices arise naturally in multiphysics problems when all variables associated with a mesh point are ordered together, and are different from the general block matrices since the sizes of the blocks are so small one can often invert some of the diagonal blocks explicitly. Motivated by the recent works of Chow and Patel and Chow et al., we propose an efficient incomplete LU (ILU) preconditioner for point-block matrices targeting applications on GPU. The construction of the preconditioner involves two critical steps: (1) the initial guessing of values for the lower and upper triangular matrices; and (2) several sweeps of asynchronous updating of the triangular matrices. Three representative problems are studied to show the advantage of the proposed point-block approach over the standard point-wise approach in terms of the number of GMRES iterations and also the total compute time. Moreover, we compare the proposed algorithm with the level-scheduling based parallel algorithm employed in NVIDIA’s cuSPARSE library as well as the serial method implemented in Intel MKL library, and the experiments show that a 2×–5× speedup can be achieved over the block-based ILU(p) factorizations from the cuSPARSE library.

当与网格点相关的所有变量被排序在一起时，点块矩阵在多重物理问题中自然产生，并且与一般块矩阵不同，因为块的大小非常小，通常可以显式地反转一些对角块。受Chow和Patel以及Chow等人最近工作的启发，我们提出了一种针对GPU上应用的点块矩阵的高效不完全LU（ILU）预处理器。预处理器的构造包括两个关键步骤：（1）对上下三角矩阵的值的初始猜测；以及（2）三角矩阵的异步更新的若干次扫描。研究了三个具有代表性的问题，以表明所提出的点块方法在GMRES迭代次数和总计算时间方面优于标准逐点方法。此外，我们将所提出的算法与NVIDIA的cuSPARSE库中采用的基于级别调度的并行算法以及英特尔MKL库中实现的串行方法进行了比较，实验表明，与来自cuSPARSE的基于块的ILU（p）因子分解相比，可以实现2×–5×的加速。

{"title":"Point-block incomplete LU preconditioning with asynchronous iterations on GPU for multiphysics problems","authors":"Wenpeng Ma, X. Cai","doi":"10.1177/1094342020981153","DOIUrl":"https://doi.org/10.1177/1094342020981153","url":null,"abstract":"Point-block matrices arise naturally in multiphysics problems when all variables associated with a mesh point are ordered together, and are different from the general block matrices since the sizes of the blocks are so small one can often invert some of the diagonal blocks explicitly. Motivated by the recent works of Chow and Patel and Chow et al., we propose an efficient incomplete LU (ILU) preconditioner for point-block matrices targeting applications on GPU. The construction of the preconditioner involves two critical steps: (1) the initial guessing of values for the lower and upper triangular matrices; and (2) several sweeps of asynchronous updating of the triangular matrices. Three representative problems are studied to show the advantage of the proposed point-block approach over the standard point-wise approach in terms of the number of GMRES iterations and also the total compute time. Moreover, we compare the proposed algorithm with the level-scheduling based parallel algorithm employed in NVIDIA’s cuSPARSE library as well as the serial method implemented in Intel MKL library, and the experiments show that a 2×–5× speedup can be achieved over the block-based ILU(p) factorizations from the cuSPARSE library.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"121 - 135"},"PeriodicalIF":3.1,"publicationDate":"2020-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342020981153","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44174581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation 通过代码生成，在cpu和gpu上以高密度比率进行高效的晶格玻尔兹曼多相非混相流体模拟

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2020-12-11 DOI: 10.1177/10943420211016525

M. Holzer, Martin Bauer, H. Köstler, U. Rüde

A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Meta-programming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework waLBerla to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behavior. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario—a three-dimensional rising air bubble in water.

提出了一种基于支持高密度比和高雷诺数的保守Allen-Cahn模型的多相晶格玻尔兹曼方法的高性能实现。元编程技术用于自动生成cpu和gpu的优化代码。耦合模型在高级符号描述中指定，并通过自动转换进行优化。通过计算核的融合，减少了算法的内存占用。屋顶线分析证明了在单个GPU上生成的代码的卓越效率。由此产生的单个GPU代码已集成到多物理场框架waLBerla中，以在大型域上运行大规模并行模拟。通信隐藏和支持gpudirect的MPI产生近乎完美的缩放行为。缩放实验在Piz paint超级计算机上进行，该计算机拥有多达2048个gpu，模拟了数百个完全分解的气泡。此外，在一个物理相关的场景中——水中三维上升的气泡——验证了该实现。

{"title":"Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation","authors":"M. Holzer, Martin Bauer, H. Köstler, U. Rüde","doi":"10.1177/10943420211016525","DOIUrl":"https://doi.org/10.1177/10943420211016525","url":null,"abstract":"A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Meta-programming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework waLBerla to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behavior. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario—a three-dimensional rising air bubble in water.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"413 - 427"},"PeriodicalIF":3.1,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211016525","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43072628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Journal of High Performance Computing Applications

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀