Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers

IF 2.1 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Parallel Computing Pub Date : 2022-09-01 Epub Date: 2022-05-31 DOI:10.1016/j.parco.2022.102940

Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman

{"title":"Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers","authors":"Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman","doi":"10.1016/j.parco.2022.102940","DOIUrl":null,"url":null,"abstract":"<div>The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the <math><mrow><mi>Q</mi><mi>R</mi></mrow></math> factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur eigenvalue computations. Performance experiments on the ORNL Summit supercomputer then establish the superiority of DCGS2 over CGS2.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102940"},"PeriodicalIF":2.1000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819122000394","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/5/31 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 9

Abstract

The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the $Q R$ factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur eigenvalue computations. Performance experiments on the ORNL Summit supercomputer then establish the superiority of DCGS2 over CGS2.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Krylov解的延迟再正交化低同步Gram-Schmidt

迭代方法的并行强标度通常由每次迭代的全局约简次数决定。本文将低同步Gram-Schmidt算法应用于Arnoldi算法，以减少全局约简的数量，从而提高非对称矩阵迭代求解器(如GMRES和Krylov-Schur迭代方法)的并行强尺度性。在Arnoldi上下文中，QR分解是“向左看”的，每次处理一列。在为Arnoldi算法生成正交基的方法中，经典的Gram-Schmidt算法(CGS2)每次迭代需要三次全局约简。提出了一种每次迭代只需要一次约简的CGS2新变体，并将其应用于Arnoldi算法。延迟CGS2 (DCGS2)为每次一列的算法使用每次迭代的最小全局约简数(1)。新算法背后的主要思想是通过重新安排操作顺序来分组全局约简。DCGS2必须小心地集成到Arnoldi扩展或GMRES求解器中。数值稳定性实验评估Krylov-Schur特征值计算的鲁棒性。在ORNL Summit超级计算机上的性能实验验证了DCGS2优于CGS2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications

期刊最新文献

Exploring metrics for analyzing dynamic behavior in MPI programs via a coupled-oscillator model ShyLU-node: On-node scalable solvers and preconditioners: Recent progress and current performance Evaluating SYCL as a unified programming model for heterogeneous systems A survey of parallel computing frameworks and optimizations for AI and deep learning Editorial Board