V. Lefèvre, N. Louvet, J. Muller, Joris Picot, L. Rideau
We consider the computation of the Euclidean (or L2) norm of an n-dimensional vector in floating-point arithmetic. We review the classical solutions used to avoid spurious overflow or underflow and/or to obtain very accurate results. We modify a recently published algorithm (that uses double-word arithmetic) to allow for a very accurate solution, free of spurious overflows and underflows. To that purpose, we use a double-word square-root algorithm of which we provide a tight error analysis. The returned L2 norm will be within very slightly more than 0.5 ulp from the exact result, which means that we will almost always provide correct rounding.
{"title":"Accurate Calculation of Euclidean Norms Using Double-word Arithmetic","authors":"V. Lefèvre, N. Louvet, J. Muller, Joris Picot, L. Rideau","doi":"10.1145/3568672","DOIUrl":"https://doi.org/10.1145/3568672","url":null,"abstract":"We consider the computation of the Euclidean (or L2) norm of an n-dimensional vector in floating-point arithmetic. We review the classical solutions used to avoid spurious overflow or underflow and/or to obtain very accurate results. We modify a recently published algorithm (that uses double-word arithmetic) to allow for a very accurate solution, free of spurious overflows and underflows. To that purpose, we use a double-word square-root algorithm of which we provide a tight error analysis. The returned L2 norm will be within very slightly more than 0.5 ulp from the exact result, which means that we will almost always provide correct rounding.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 34"},"PeriodicalIF":2.7,"publicationDate":"2022-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46100009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
IFISS is an established MATLAB finite element software package for studying strategies for solving partial differential equations (PDEs). IFISS3D is a new add-on toolbox that extends IFISS capabilities for elliptic PDEs from two to three space dimensions. The open-source MATLAB framework provides a computational laboratory for experimentation and exploration of finite element approximation and error estimation, as well as iterative solvers. The package is designed to be useful as a teaching tool for instructors and students who want to learn about state-of-the-art finite element methodology. It will also be useful for researchers as a source of reproducible test matrices of arbitrarily large dimension.
{"title":"IFISS3D: A Computational Laboratory for Investigating Finite Element Approximation in Three Dimensions","authors":"Georgios Papanikos, C. Powell, D. Silvester","doi":"10.1145/3604934","DOIUrl":"https://doi.org/10.1145/3604934","url":null,"abstract":"IFISS is an established MATLAB finite element software package for studying strategies for solving partial differential equations (PDEs). IFISS3D is a new add-on toolbox that extends IFISS capabilities for elliptic PDEs from two to three space dimensions. The open-source MATLAB framework provides a computational laboratory for experimentation and exploration of finite element approximation and error estimation, as well as iterative solvers. The package is designed to be useful as a teaching tool for instructors and students who want to learn about state-of-the-art finite element methodology. It will also be useful for researchers as a source of reproducible test matrices of arbitrarily large dimension.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 14"},"PeriodicalIF":2.7,"publicationDate":"2022-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47090209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic differentiation (AD) is a well-known technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forward mode, operator overloading-based differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations that is portable to a wide variety of emerging architectures. We describe the challenges that arise when differentiating code for these architectures using Kokkos, and two approaches for overcoming them that ensure optimal memory access patterns as well as expose additional dimensions of fine-grained parallelism in the derivative calculation. We describe the results of several computational experiments that demonstrate the performance of the approach on a few contemporary CPU and GPU architectures. We then conclude with applications of these techniques to the simulation of discretized systems of partial differential equations.
{"title":"Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with Sacado","authors":"E. Phipps, R. Pawlowski, C. Trott","doi":"10.1145/3560262","DOIUrl":"https://doi.org/10.1145/3560262","url":null,"abstract":"Automatic differentiation (AD) is a well-known technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forward mode, operator overloading-based differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations that is portable to a wide variety of emerging architectures. We describe the challenges that arise when differentiating code for these architectures using Kokkos, and two approaches for overcoming them that ensure optimal memory access patterns as well as expose additional dimensions of fine-grained parallelism in the derivative calculation. We describe the results of several computational experiments that demonstrate the performance of the approach on a few contemporary CPU and GPU architectures. We then conclude with applications of these techniques to the simulation of discretized systems of partial differential equations.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"48 1","pages":"1 - 29"},"PeriodicalIF":2.7,"publicationDate":"2022-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42830984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a correction and an improvement to Algorithm 1010 [A. Orellana and C. De Michele 2020].
本文对算法1010 [a]进行了修正和改进。Orellana and C. De Michele 2020]。
{"title":"Remark on Algorithm 1010: Boosting Efficiency in Solving Quartic Equations with No Compromise in Accuracy","authors":"C. De Michele","doi":"10.1145/3564270","DOIUrl":"https://doi.org/10.1145/3564270","url":null,"abstract":"We present a correction and an improvement to Algorithm 1010 [A. Orellana and C. De Michele 2020].","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"48 1","pages":"1 - 3"},"PeriodicalIF":2.7,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64059873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Version 5.99 of the empirical Gramian framework – emgr – completes a development cycle which focused on parametric model order reduction of gas network models while preserving compatibility to the previous development for the application of combined state and parameter reduction for neuroscience network models. Second, new features concerning empirical Gramian types, perturbation design, and trajectory post-processing, as well as a Python version in addition to the default MATLAB / Octave implementation, have been added. This work summarizes these changes, particularly since emgr version 5.4, see Himpe, 2018 [Algorithms 11(7): 91], and gives recent as well as future applications, such as parameter identification in systems biology, based on the current feature set.
{"title":"emgr – EMpirical GRamian Framework Version 5.99","authors":"Christian Himpe","doi":"10.1145/3609860","DOIUrl":"https://doi.org/10.1145/3609860","url":null,"abstract":"Version 5.99 of the empirical Gramian framework – emgr – completes a development cycle which focused on parametric model order reduction of gas network models while preserving compatibility to the previous development for the application of combined state and parameter reduction for neuroscience network models. Second, new features concerning empirical Gramian types, perturbation design, and trajectory post-processing, as well as a Python version in addition to the default MATLAB / Octave implementation, have been added. This work summarizes these changes, particularly since emgr version 5.4, see Himpe, 2018 [Algorithms 11(7): 91], and gives recent as well as future applications, such as parameter identification in systems biology, based on the current feature set.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 8"},"PeriodicalIF":2.7,"publicationDate":"2022-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64082663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article presents a fast SIMD Hilbert space-filling curve generator, which supports a new cache-oblivious blocking-scheme technique applied to the out-of-place transposition of general matrices. Matrix operations found in high performance computing libraries are usually parameterized based on host microprocessor specifications to minimize data movement within the different levels of memory hierarchy. The performance of cache-oblivious algorithms does not rely on such parameterizations. This type of algorithm provides an elegant and portable solution to address the lack of standardization in modern-day processors. Our solution consists in an iterative blocking scheme that takes advantage of the locality-preserving properties of Hilbert space-filling curves to minimize data movement in any memory hierarchy. This scheme traverses the input matrix, in O(nm) time and space, improving the behavior of matrix algorithms that inherently present poor memory locality. The application of this technique to the problem of out-of-place matrix transposition achieved competitive results when compared to state-of-the-art approaches. The performance of our solution surpassed Intel MKL version after employing standard software prefetching techniques.
{"title":"Cache-oblivious Hilbert Curve-based Blocking Scheme for Matrix Transposition","authors":"J. N. F. Alves, L. Russo, Alexandre P. Francisco","doi":"10.1145/3555353","DOIUrl":"https://doi.org/10.1145/3555353","url":null,"abstract":"This article presents a fast SIMD Hilbert space-filling curve generator, which supports a new cache-oblivious blocking-scheme technique applied to the out-of-place transposition of general matrices. Matrix operations found in high performance computing libraries are usually parameterized based on host microprocessor specifications to minimize data movement within the different levels of memory hierarchy. The performance of cache-oblivious algorithms does not rely on such parameterizations. This type of algorithm provides an elegant and portable solution to address the lack of standardization in modern-day processors. Our solution consists in an iterative blocking scheme that takes advantage of the locality-preserving properties of Hilbert space-filling curves to minimize data movement in any memory hierarchy. This scheme traverses the input matrix, in O(nm) time and space, improving the behavior of matrix algorithms that inherently present poor memory locality. The application of this technique to the problem of out-of-place matrix transposition achieved competitive results when compared to state-of-the-art approaches. The performance of our solution surpassed Intel MKL version after employing standard software prefetching techniques.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"48 1","pages":"1 - 28"},"PeriodicalIF":2.7,"publicationDate":"2022-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43226486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Brust, O. Burdakov, Jennifer B. Erway, Roummel F. Marcia
We present a MATLAB implementation of the symmetric rank-one (SC-SR1) method that solves trust-region subproblems when a limited-memory symmetric rank-one (L-SR1) matrix is used in place of the true Hessian matrix, which can be used for large-scale optimization. The method takes advantage of two shape-changing norms [7, 9] to decompose the trust-region subproblem into two separate problems. Using one of the proposed norms, the resulting subproblems have closed-form solutions. Meanwhile, using the other proposed norm, one of the resulting subproblems has a closed-form solution while the other is easily solvable using techniques that exploit the structure of L-SR1 matrices. Numerical results suggest that the SC-SR1 method is able to solve trust-region subproblems to high accuracy even in the so-called “hard case”. When integrated into a trust-region algorithm, extensive numerical experiments suggest that the proposed algorithms perform well, when compared with widely used solvers, such as truncated CG.
{"title":"Algorithm xxx: SC-SR1: MATLAB Software for Limited-Memory SR1 Trust-Region Methods","authors":"J. Brust, O. Burdakov, Jennifer B. Erway, Roummel F. Marcia","doi":"10.1145/3550269","DOIUrl":"https://doi.org/10.1145/3550269","url":null,"abstract":"We present a MATLAB implementation of the symmetric rank-one (SC-SR1) method that solves trust-region subproblems when a limited-memory symmetric rank-one (L-SR1) matrix is used in place of the true Hessian matrix, which can be used for large-scale optimization. The method takes advantage of two shape-changing norms [7, 9] to decompose the trust-region subproblem into two separate problems. Using one of the proposed norms, the resulting subproblems have closed-form solutions. Meanwhile, using the other proposed norm, one of the resulting subproblems has a closed-form solution while the other is easily solvable using techniques that exploit the structure of L-SR1 matrices. Numerical results suggest that the SC-SR1 method is able to solve trust-region subproblems to high accuracy even in the so-called “hard case”. When integrated into a trust-region algorithm, extensive numerical experiments suggest that the proposed algorithms perform well, when compared with widely used solvers, such as truncated CG.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2022-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41802660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The hp-adaptive finite element method—where one independently chooses the mesh size (h) and polynomial degree (p) to be used on each cell—has long been known to have better theoretical convergence properties than either h- or p-adaptive methods alone. However, it is not widely used, owing at least in part to the difficulty of the underlying algorithms and the lack of widely usable implementations. This is particularly true when used with continuous finite elements. Herein, we discuss algorithms that are necessary for a comprehensive and generic implementation of hp-adaptive finite element methods on distributed-memory, parallel machines. In particular, we will present a multistage algorithm for the unique enumeration of degrees of freedom suitable for continuous finite element spaces, describe considerations for weighted load balancing, and discuss the transfer of variable size data between processes. We illustrate the performance of our algorithms with numerical examples and demonstrate that they scale reasonably up to at least 16,384 message passage interface processes. We provide a reference implementation of our algorithms as part of the open source library deal.II.
{"title":"Algorithms for Parallel Generic hp-Adaptive Finite Element Software","authors":"M. Fehling, W. Bangerth","doi":"10.1145/3603372","DOIUrl":"https://doi.org/10.1145/3603372","url":null,"abstract":"The hp-adaptive finite element method—where one independently chooses the mesh size (h) and polynomial degree (p) to be used on each cell—has long been known to have better theoretical convergence properties than either h- or p-adaptive methods alone. However, it is not widely used, owing at least in part to the difficulty of the underlying algorithms and the lack of widely usable implementations. This is particularly true when used with continuous finite elements. Herein, we discuss algorithms that are necessary for a comprehensive and generic implementation of hp-adaptive finite element methods on distributed-memory, parallel machines. In particular, we will present a multistage algorithm for the unique enumeration of degrees of freedom suitable for continuous finite element spaces, describe considerations for weighted load balancing, and discuss the transfer of variable size data between processes. We illustrate the performance of our algorithms with numerical examples and demonstrate that they scale reasonably up to at least 16,384 message passage interface processes. We provide a reference implementation of our algorithms as part of the open source library deal.II.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 26"},"PeriodicalIF":2.7,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44759580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Reynolds, D. J. Gardner, C. Woodward, Rujeko Chinomona
We describe the ARKODE library of one-step time integration methods for ordinary differential equation (ODE) initial-value problems (IVPs). In addition to providing standard explicit and diagonally implicit Runge–Kutta methods, ARKODE supports one-step methods designed to treat additive splittings of the IVP, including implicit-explicit (ImEx) additive Runge–Kutta methods and multirate infinitesimal (MRI) methods. We present the role of ARKODE within the SUNDIALS suite of time integration and nonlinear solver libraries, the core ARKODE infrastructure for utilities common to large classes of one-step methods, as well as its use of “time stepper” modules enabling easy incorporation of novel algorithms into the library. Numerical results show example problems of increasing complexity, highlighting the algorithmic flexibility afforded through this infrastructure, and include a larger multiphysics application leveraging multiple algorithmic features from ARKODE and SUNDIALS.
{"title":"ARKODE: A Flexible IVP Solver Infrastructure for One-step Methods","authors":"D. Reynolds, D. J. Gardner, C. Woodward, Rujeko Chinomona","doi":"10.1145/3594632","DOIUrl":"https://doi.org/10.1145/3594632","url":null,"abstract":"We describe the ARKODE library of one-step time integration methods for ordinary differential equation (ODE) initial-value problems (IVPs). In addition to providing standard explicit and diagonally implicit Runge–Kutta methods, ARKODE supports one-step methods designed to treat additive splittings of the IVP, including implicit-explicit (ImEx) additive Runge–Kutta methods and multirate infinitesimal (MRI) methods. We present the role of ARKODE within the SUNDIALS suite of time integration and nonlinear solver libraries, the core ARKODE infrastructure for utilities common to large classes of one-step methods, as well as its use of “time stepper” modules enabling easy incorporation of novel algorithms into the library. Numerical results show example problems of increasing complexity, highlighting the algorithmic flexibility afforded through this infrastructure, and include a larger multiphysics application leveraging multiple algorithmic features from ARKODE and SUNDIALS.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":" ","pages":"1 - 26"},"PeriodicalIF":2.7,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47709068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
{"title":"Algorithm XXX: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions","authors":"C. Psarras, L. Karlsson, R. Bro, P. Bientinesi","doi":"10.1145/3519383","DOIUrl":"https://doi.org/10.1145/3519383","url":null,"abstract":"Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41347702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}