Tyler H. Chang, L. Watson, T. Lux, A. Butt, K. Cameron, Yili Hong
DELAUNAYSPARSE contains both serial and parallel codes written in Fortran 2003 (with OpenMP) for performing medium- to high-dimensional interpolation via the Delaunay triangulation. To accommodate the exponential growth in the size of the Delaunay triangulation in high dimensions, DELAUNAYSPARSE computes only a sparse subset of the complete Delaunay triangulation, as necessary for performing interpolation at the user specified points. This article includes algorithm and implementation details, complexity and sensitivity analyses, usage information, and a brief performance study.
{"title":"Algorithm 1012","authors":"Tyler H. Chang, L. Watson, T. Lux, A. Butt, K. Cameron, Yili Hong","doi":"10.1145/3422818","DOIUrl":"https://doi.org/10.1145/3422818","url":null,"abstract":"DELAUNAYSPARSE contains both serial and parallel codes written in Fortran 2003 (with OpenMP) for performing medium- to high-dimensional interpolation via the Delaunay triangulation. To accommodate the exponential growth in the size of the Delaunay triangulation in high dimensions, DELAUNAYSPARSE computes only a sparse subset of the complete Delaunay triangulation, as necessary for performing interpolation at the user specified points. This article includes algorithm and implementation details, complexity and sensitivity analyses, usage information, and a brief performance study.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"PC-30 1","pages":"1 - 20"},"PeriodicalIF":0.0,"publicationDate":"2020-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84863862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article by Flegar et al. titled “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software” presents a novel, practical implementation of an adaptive precision block-Jacobi preconditioner. Performance results using state-of-the-art GPU architectures for the block-Jacobi preconditioner generation and application demonstrate the practical usability of the method, compared to a traditional full-precision block-Jacobi preconditioner. A production-ready implementation is provided in the Ginkgo numerical linear algebra library. In this report, the Ginkgo library is reinstalled and performance results are generated to perform a comparison to the original results when using Ginkgo’s Conjugate Gradient solver with either the full or the adaptive precision block-Jacobi preconditioner for a suite of test problems on an NVIDIA GPU accelerator. After completing this process, the published results are deemed reproducible.
{"title":"Replicated Computational Results (RCR) Report for “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software”","authors":"S. Osborn","doi":"10.1145/3446000","DOIUrl":"https://doi.org/10.1145/3446000","url":null,"abstract":"The article by Flegar et al. titled “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software” presents a novel, practical implementation of an adaptive precision block-Jacobi preconditioner. Performance results using state-of-the-art GPU architectures for the block-Jacobi preconditioner generation and application demonstrate the practical usability of the method, compared to a traditional full-precision block-Jacobi preconditioner. A production-ready implementation is provided in the Ginkgo numerical linear algebra library. In this report, the Ginkgo library is reinstalled and performance results are generated to perform a comparison to the original results when using Ginkgo’s Conjugate Gradient solver with either the full or the adaptive precision block-Jacobi preconditioner for a suite of test problems on an NVIDIA GPU accelerator. After completing this process, the published results are deemed reproducible.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"16 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2020-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74237122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, a complete set of algorithms for manipulating double-word numbers (some classical, some new) was analyzed [16]. We have formally proven all the theorems given in that article, using the Coq proof assistant. The formal proof work led us to: (i) locate mistakes in some of the original paper proofs (mistakes that, however, do not hinder the validity of the algorithms), (ii) significantly improve some error bounds, and (iii) generalize some results by showing that they are still valid if we slightly change the rounding mode. The consequence is that the algorithms presented in [16] can be used with high confidence, and that some of them are even more accurate than what was believed before. This illustrates what formal proof can bring to computer arithmetic: beyond mere (yet extremely useful) verification, correction, and consolidation of already known results, it can help to find new properties. All our formal proofs are freely available.
{"title":"Formalization of Double-Word Arithmetic, and Comments on “Tight and Rigorous Error Bounds for Basic Building Blocks of Double-Word Arithmetic”","authors":"J. Muller, L. Rideau","doi":"10.1145/3484514","DOIUrl":"https://doi.org/10.1145/3484514","url":null,"abstract":"Recently, a complete set of algorithms for manipulating double-word numbers (some classical, some new) was analyzed [16]. We have formally proven all the theorems given in that article, using the Coq proof assistant. The formal proof work led us to: (i) locate mistakes in some of the original paper proofs (mistakes that, however, do not hinder the validity of the algorithms), (ii) significantly improve some error bounds, and (iii) generalize some results by showing that they are still valid if we slightly change the rounding mode. The consequence is that the algorithms presented in [16] can be used with high confidence, and that some of them are even more accurate than what was believed before. This illustrates what formal proof can bring to computer arithmetic: beyond mere (yet extremely useful) verification, correction, and consolidation of already known results, it can help to find new properties. All our formal proofs are freely available.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"258 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86714386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article presents a parallel, effective, and feature-complete recursive SPIKE algorithm that achieves near feature-parity with the standard linear algebra package banded linear system solver. First, we present a flexible parallel implementation of the recursive SPIKE scheme that aims at removing its original limitation that the number of cores/processors be restricted to powers of two. A new transpose solve option for SPIKE is then developed to satisfy a standard requirement of most numerical solver libraries. Finally, a pivoting recursive SPIKE strategy is presented as an alternative to the non-pivoting scheme to improve numerical stability. All these new enhancements lead to the release of a new black-box feature-complete SPIKE-OpenMP package that significantly improves upon the performance and scalability obtained with other state-of-the-art banded solvers.
{"title":"A Feature-complete SPIKE Dense Banded Solver","authors":"Braegan S. Spring, E. Polizzi, A. Sameh","doi":"10.1145/3410153","DOIUrl":"https://doi.org/10.1145/3410153","url":null,"abstract":"This article presents a parallel, effective, and feature-complete recursive SPIKE algorithm that achieves near feature-parity with the standard linear algebra package banded linear system solver. First, we present a flexible parallel implementation of the recursive SPIKE scheme that aims at removing its original limitation that the number of cores/processors be restricted to powers of two. A new transpose solve option for SPIKE is then developed to satisfy a standard requirement of most numerical solver libraries. Finally, a pivoting recursive SPIKE strategy is presented as an alternative to the non-pivoting scheme to improve numerical stability. All these new enhancements lead to the release of a new black-box feature-complete SPIKE-OpenMP package that significantly improves upon the performance and scalability obtained with other state-of-the-art banded solvers.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"57 1","pages":"1 - 35"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77643325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two-step embedded methods of order s based on s-stage Radau IIA formulas are considered for the variable step-size integration of stiff differential equations. These embedded methods are aimed at local error control and are computed through a linear combination of the internal stages of the underlying method in the last two steps. Particular embedded methods for 2 ≤ s ≤ 7 internal stages with good stability properties and damping for the stiff components are constructed. Furthermore, a new formula for step-size change is proposed, having the advantage that it can be applied to any s-stage Radau IIA method. It is shown through numerical testing on some representative stiff problems that the RADAU5 code by Hairer and Wanner with the new strategy performs as well or even better as with the standard one, which is only feasible for an odd number of stages. Numerical experiments support the efficiency and flexibility of the new step-size change strategy.
{"title":"Variable Step-Size Control Based on Two-Steps for Radau IIA Methods","authors":"S. G. Pinto, D. H. Abreu, J. I. Montijano","doi":"10.1145/3408892","DOIUrl":"https://doi.org/10.1145/3408892","url":null,"abstract":"Two-step embedded methods of order s based on s-stage Radau IIA formulas are considered for the variable step-size integration of stiff differential equations. These embedded methods are aimed at local error control and are computed through a linear combination of the internal stages of the underlying method in the last two steps. Particular embedded methods for 2 ≤ s ≤ 7 internal stages with good stability properties and damping for the stiff components are constructed. Furthermore, a new formula for step-size change is proposed, having the advantage that it can be applied to any s-stage Radau IIA method. It is shown through numerical testing on some representative stiff problems that the RADAU5 code by Hairer and Wanner with the new strategy performs as well or even better as with the standard one, which is only feasible for an odd number of stages. Numerical experiments support the efficiency and flexibility of the new step-size change strategy.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"12 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81040413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Thies, Melven Röhrig-Zöllner, N. Overmars, A. Basermann, Dominik Ernst, G. Hager, G. Wellein
The increasing complexity of hardware and software environments in high-performance computing poses big challenges on the development of sustainable and hardware-efficient numerical software. This article addresses these challenges in the context of sparse solvers. Existing solutions typically target sustainability, flexibility, or performance, but rarely all of them. Our new library PHIST provides implementations of solvers for sparse linear systems and eigenvalue problems. It is a productivity platform for performance-aware developers of algorithms and application software with abstractions that do not obscure the view on hardware-software interaction. The PHIST software architecture and the PHIST development process were designed to overcome shortcomings of existing packages. An interface layer for basic sparse linear algebra functionality that can be provided by multiple backends ensures sustainability, and PHIST supports common techniques for improving scalability and performance of algorithms such as blocking and kernel fusion. We showcase these concepts using the PHIST implementation of a block Jacobi-Davidson solver for non-Hermitian and generalized eigenproblems. We study its performance on a multi-core CPU, a GPU, and a large-scale many-core system. Furthermore, we show how an existing implementation of a block Krylov-Schur method in the Trilinos package Anasazi can benefit from the performance engineering techniques used in PHIST.
{"title":"PHIST","authors":"J. Thies, Melven Röhrig-Zöllner, N. Overmars, A. Basermann, Dominik Ernst, G. Hager, G. Wellein","doi":"10.1145/3402227","DOIUrl":"https://doi.org/10.1145/3402227","url":null,"abstract":"The increasing complexity of hardware and software environments in high-performance computing poses big challenges on the development of sustainable and hardware-efficient numerical software. This article addresses these challenges in the context of sparse solvers. Existing solutions typically target sustainability, flexibility, or performance, but rarely all of them. Our new library PHIST provides implementations of solvers for sparse linear systems and eigenvalue problems. It is a productivity platform for performance-aware developers of algorithms and application software with abstractions that do not obscure the view on hardware-software interaction. The PHIST software architecture and the PHIST development process were designed to overcome shortcomings of existing packages. An interface layer for basic sparse linear algebra functionality that can be provided by multiple backends ensures sustainability, and PHIST supports common techniques for improving scalability and performance of algorithms such as blocking and kernel fusion. We showcase these concepts using the PHIST implementation of a block Jacobi-Davidson solver for non-Hermitian and generalized eigenproblems. We study its performance on a multi-core CPU, a GPU, and a large-scale many-core system. Furthermore, we show how an existing implementation of a block Krylov-Schur method in the Trilinos package Anasazi can benefit from the performance engineering techniques used in PHIST.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"283 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73398264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In several papers of 2013–2016, Guglielmi and Protasov made a breakthrough in the problem of the joint spectral radius computation, developing the invariant polytope algorithm that for most matrix families finds the exact value of the joint spectral radius. This algorithm found many applications in problems of functional analysis, approximation theory, combinatorics, and so on. In this article, we propose a modification of the invariant polytope algorithm making it roughly 3 times faster (single threaded), suitable for higher dimensions, and parallelise it. The modified version works for most matrix families of dimensions up to 25, for non-negative matrices up to 3,000. In addition, we introduce a new, fast algorithm, called modified Gripenberg algorithm, for computing good lower bounds for the joint spectral radius. The corresponding examples and statistics of numerical results are provided. Several applications of our algorithms are presented. In particular, we find the exact values of the regularity exponents of Daubechies wavelets up to order 42 and the capacities of codes that avoid certain difference patterns.
{"title":"Algorithm 1011","authors":"Thomas Mejstrik","doi":"10.1145/3408891","DOIUrl":"https://doi.org/10.1145/3408891","url":null,"abstract":"In several papers of 2013–2016, Guglielmi and Protasov made a breakthrough in the problem of the joint spectral radius computation, developing the invariant polytope algorithm that for most matrix families finds the exact value of the joint spectral radius. This algorithm found many applications in problems of functional analysis, approximation theory, combinatorics, and so on. In this article, we propose a modification of the invariant polytope algorithm making it roughly 3 times faster (single threaded), suitable for higher dimensions, and parallelise it. The modified version works for most matrix families of dimensions up to 25, for non-negative matrices up to 3,000. In addition, we introduce a new, fast algorithm, called modified Gripenberg algorithm, for computing good lower bounds for the joint spectral radius. The corresponding examples and statistics of numerical results are provided. Several applications of our algorithms are presented. In particular, we find the exact values of the regularity exponents of Daubechies wavelets up to order 42 and the capacities of codes that avoid certain difference patterns.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"20 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2020-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83323323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timothée Ewart, Francesco Cremonesi, F. Schürmann, F. Delalondre
The evaluation of small degree polynomials is critical for the computation of elementary functions. It has been extensively studied and is well documented. In this article, we evaluate existing methods for polynomial evaluation on superscalar architecture. In addition, we have completed this work with a factorization method, which is surprisingly neglected in the literature. This work focuses on out-of-order Intel processors, amongst others, of which computational units are available. Moreover, we applied our work on the elementary function ex that requires, in the current implementation, an evaluation of a polynomial of degree 10 for a satisfying precision and performance. Our results show that the factorization scheme is the fastest in benchmarks, and that latency and throughput are intrinsically dependent on each other on superscalar architecture.
{"title":"Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function ex","authors":"Timothée Ewart, Francesco Cremonesi, F. Schürmann, F. Delalondre","doi":"10.1145/3408893","DOIUrl":"https://doi.org/10.1145/3408893","url":null,"abstract":"The evaluation of small degree polynomials is critical for the computation of elementary functions. It has been extensively studied and is well documented. In this article, we evaluate existing methods for polynomial evaluation on superscalar architecture. In addition, we have completed this work with a factorization method, which is surprisingly neglected in the literature. This work focuses on out-of-order Intel processors, amongst others, of which computational units are available. Moreover, we applied our work on the elementary function ex that requires, in the current implementation, an evaluation of a polynomial of degree 10 for a satisfying precision and performance. Our results show that the factorization scheme is the fastest in benchmarks, and that latency and throughput are intrinsically dependent on each other on superscalar architecture.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"1 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2020-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84015150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicoló Gusmeroli, T. Hrga, Borut Lužar, J. Povh, Melanie Siebenhofer, Angelika Wiegele
We present BiqBin, an exact solver for linearly constrained binary quadratic problems. Our approach is based on an exact penalty method to first efficiently transform the original problem into an instance of Max-Cut, and then to solve the Max-Cut problem by a branch-and-bound algorithm. All the main ingredients are carefully developed using new semidefinite programming relaxations obtained by strengthening the existing relaxations with a set of hypermetric inequalities, applying the bundle method as the bounding routine and using new strategies for exploring the branch-and-bound tree. Furthermore, an efficient C implementation of a sequential and a parallel branch-and-bound algorithm is presented. The latter is based on a load coordinator-worker scheme using MPI for multi-node parallelization and is evaluated on a high-performance computer. The new solver is benchmarked against BiqCrunch, GUROBI, and SCIP on four families of (linearly constrained) binary quadratic problems. Numerical results demonstrate that BiqBin is a highly competitive solver. The serial version outperforms the other three solvers on the majority of the benchmark instances. We also evaluate the parallel solver and show that it has good scaling properties. The general audience can use it as an on-line service available at http://www.biqbin.eu.
{"title":"BiqBin: A Parallel Branch-and-bound Solver for Binary Quadratic Problems with Linear Constraints","authors":"Nicoló Gusmeroli, T. Hrga, Borut Lužar, J. Povh, Melanie Siebenhofer, Angelika Wiegele","doi":"10.1145/3514039","DOIUrl":"https://doi.org/10.1145/3514039","url":null,"abstract":"We present BiqBin, an exact solver for linearly constrained binary quadratic problems. Our approach is based on an exact penalty method to first efficiently transform the original problem into an instance of Max-Cut, and then to solve the Max-Cut problem by a branch-and-bound algorithm. All the main ingredients are carefully developed using new semidefinite programming relaxations obtained by strengthening the existing relaxations with a set of hypermetric inequalities, applying the bundle method as the bounding routine and using new strategies for exploring the branch-and-bound tree. Furthermore, an efficient C implementation of a sequential and a parallel branch-and-bound algorithm is presented. The latter is based on a load coordinator-worker scheme using MPI for multi-node parallelization and is evaluated on a high-performance computer. The new solver is benchmarked against BiqCrunch, GUROBI, and SCIP on four families of (linearly constrained) binary quadratic problems. Numerical results demonstrate that BiqBin is a highly competitive solver. The serial version outperforms the other three solvers on the majority of the benchmark instances. We also evaluate the parallel solver and show that it has good scaling properties. The general audience can use it as an on-line service available at http://www.biqbin.eu.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"1 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2020-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88629732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We define “reproducibility” as getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not affect the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociative floating point addition, makes reproducibility challenging even for summation, or operations like the BLAS. We describe a “reproducible accumulator” data structure (the “binned number”) and associated algorithms to reproducibly sum binary floating point numbers, independent of summation order. We use a subset of the IEEE Floating Point Standard 754-2008 and bitwise operations on the standard representations in memory. Our approach requires only one read-only pass over the data, and one reduction in parallel, using a 6-word reproducible accumulator (more words can be used for higher accuracy), enabling standard tiling optimization techniques. Summing n words with a 6-word reproducible accumulator requires approximately 9n floating point operations (arithmetic, comparison, and absolute value) and approximately 3n bitwise operations. The final error bound with a 6-word reproducible accumulator and our default settings can be up to 229 times smaller than the error bound for conventional (recursive) summation on ill-conditioned double-precision inputs.
{"title":"Algorithms for Efficient Reproducible Floating Point Summation","authors":"Peter Ahrens, J. Demmel, Hong Diep Nguyen","doi":"10.1145/3389360","DOIUrl":"https://doi.org/10.1145/3389360","url":null,"abstract":"We define “reproducibility” as getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not affect the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociative floating point addition, makes reproducibility challenging even for summation, or operations like the BLAS. We describe a “reproducible accumulator” data structure (the “binned number”) and associated algorithms to reproducibly sum binary floating point numbers, independent of summation order. We use a subset of the IEEE Floating Point Standard 754-2008 and bitwise operations on the standard representations in memory. Our approach requires only one read-only pass over the data, and one reduction in parallel, using a 6-word reproducible accumulator (more words can be used for higher accuracy), enabling standard tiling optimization techniques. Summing n words with a 6-word reproducible accumulator requires approximately 9n floating point operations (arithmetic, comparison, and absolute value) and approximately 3n bitwise operations. The final error bound with a 6-word reproducible accumulator and our default settings can be up to 229 times smaller than the error bound for conventional (recursive) summation on ill-conditioned double-precision inputs.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"56 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2020-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74553113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}