首页 > 最新文献

ACM Transactions on Mathematical Software (TOMS)最新文献

英文 中文
Algorithm 1012 算法1012
Pub Date : 2020-11-07 DOI: 10.1145/3422818
Tyler H. Chang, L. Watson, T. Lux, A. Butt, K. Cameron, Yili Hong
DELAUNAYSPARSE contains both serial and parallel codes written in Fortran 2003 (with OpenMP) for performing medium- to high-dimensional interpolation via the Delaunay triangulation. To accommodate the exponential growth in the size of the Delaunay triangulation in high dimensions, DELAUNAYSPARSE computes only a sparse subset of the complete Delaunay triangulation, as necessary for performing interpolation at the user specified points. This article includes algorithm and implementation details, complexity and sensitivity analyses, usage information, and a brief performance study.
DELAUNAYSPARSE包含用Fortran 2003(使用OpenMP)编写的串行和并行代码,用于通过Delaunay三角剖分执行中高维插值。为了适应高维Delaunay三角剖分大小的指数增长,DELAUNAYSPARSE只计算完整Delaunay三角剖分的一个稀疏子集,以便在用户指定的点执行插值。本文包括算法和实现细节、复杂性和敏感性分析、使用信息以及简要的性能研究。
{"title":"Algorithm 1012","authors":"Tyler H. Chang, L. Watson, T. Lux, A. Butt, K. Cameron, Yili Hong","doi":"10.1145/3422818","DOIUrl":"https://doi.org/10.1145/3422818","url":null,"abstract":"DELAUNAYSPARSE contains both serial and parallel codes written in Fortran 2003 (with OpenMP) for performing medium- to high-dimensional interpolation via the Delaunay triangulation. To accommodate the exponential growth in the size of the Delaunay triangulation in high dimensions, DELAUNAYSPARSE computes only a sparse subset of the complete Delaunay triangulation, as necessary for performing interpolation at the user specified points. This article includes algorithm and implementation details, complexity and sensitivity analyses, usage information, and a brief performance study.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"PC-30 1","pages":"1 - 20"},"PeriodicalIF":0.0,"publicationDate":"2020-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84863862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Replicated Computational Results (RCR) Report for “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software” “银杏线性代数软件中用于高性能预处理的自适应精确块-雅可比”的重复计算结果(RCR)报告
Pub Date : 2020-10-27 DOI: 10.1145/3446000
S. Osborn
The article by Flegar et al. titled “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software” presents a novel, practical implementation of an adaptive precision block-Jacobi preconditioner. Performance results using state-of-the-art GPU architectures for the block-Jacobi preconditioner generation and application demonstrate the practical usability of the method, compared to a traditional full-precision block-Jacobi preconditioner. A production-ready implementation is provided in the Ginkgo numerical linear algebra library. In this report, the Ginkgo library is reinstalled and performance results are generated to perform a comparison to the original results when using Ginkgo’s Conjugate Gradient solver with either the full or the adaptive precision block-Jacobi preconditioner for a suite of test problems on an NVIDIA GPU accelerator. After completing this process, the published results are deemed reproducible.
Flegar等人的文章题为“用于银杏线性代数软件中高性能预处理的自适应精度块- jacobi”,提出了一种新颖的、实用的自适应精度块- jacobi预调节器。与传统的全精度块jacobi预调节器相比,使用最先进的GPU架构进行块jacobi预调节器生成和应用的性能结果证明了该方法的实际可用性。生产就绪的实现在Ginkgo数值线性代数库中提供。在本报告中,重新安装了Ginkgo库,并生成了性能结果,以便在NVIDIA GPU加速器上使用Ginkgo的共轭梯度解算器与完整或自适应精度块- jacobi预调节器进行一组测试问题时,与原始结果进行比较。完成这个过程后,发表的结果被认为是可重复的。
{"title":"Replicated Computational Results (RCR) Report for “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software”","authors":"S. Osborn","doi":"10.1145/3446000","DOIUrl":"https://doi.org/10.1145/3446000","url":null,"abstract":"The article by Flegar et al. titled “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software” presents a novel, practical implementation of an adaptive precision block-Jacobi preconditioner. Performance results using state-of-the-art GPU architectures for the block-Jacobi preconditioner generation and application demonstrate the practical usability of the method, compared to a traditional full-precision block-Jacobi preconditioner. A production-ready implementation is provided in the Ginkgo numerical linear algebra library. In this report, the Ginkgo library is reinstalled and performance results are generated to perform a comparison to the original results when using Ginkgo’s Conjugate Gradient solver with either the full or the adaptive precision block-Jacobi preconditioner for a suite of test problems on an NVIDIA GPU accelerator. After completing this process, the published results are deemed reproducible.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"16 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2020-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74237122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Formalization of Double-Word Arithmetic, and Comments on “Tight and Rigorous Error Bounds for Basic Building Blocks of Double-Word Arithmetic” 双字算法的形式化及对“双字算法基本构件的严格与严格误差界”的评析
Pub Date : 2020-10-20 DOI: 10.1145/3484514
J. Muller, L. Rideau
Recently, a complete set of algorithms for manipulating double-word numbers (some classical, some new) was analyzed [16]. We have formally proven all the theorems given in that article, using the Coq proof assistant. The formal proof work led us to: (i) locate mistakes in some of the original paper proofs (mistakes that, however, do not hinder the validity of the algorithms), (ii) significantly improve some error bounds, and (iii) generalize some results by showing that they are still valid if we slightly change the rounding mode. The consequence is that the algorithms presented in [16] can be used with high confidence, and that some of them are even more accurate than what was believed before. This illustrates what formal proof can bring to computer arithmetic: beyond mere (yet extremely useful) verification, correction, and consolidation of already known results, it can help to find new properties. All our formal proofs are freely available.
最近,对一套完整的处理双字数的算法(有些是经典的,有些是新的)进行了分析。我们已经使用Coq证明助手正式证明了那篇文章中给出的所有定理。正式证明工作使我们:(i)找到了一些原始纸质证明中的错误(然而,这些错误并不妨碍算法的有效性),(ii)显着改善了一些错误界限,(iii)通过表明如果我们稍微改变舍入模式,它们仍然有效来推广一些结果。结果是[16]中提出的算法可以高置信度地使用,其中一些算法甚至比以前认为的更准确。这说明了形式证明可以给计算机算法带来什么:除了对已知结果的简单(但非常有用的)验证、纠正和巩固之外,它还可以帮助发现新的性质。我们所有的正式证明都是免费提供的。
{"title":"Formalization of Double-Word Arithmetic, and Comments on “Tight and Rigorous Error Bounds for Basic Building Blocks of Double-Word Arithmetic”","authors":"J. Muller, L. Rideau","doi":"10.1145/3484514","DOIUrl":"https://doi.org/10.1145/3484514","url":null,"abstract":"Recently, a complete set of algorithms for manipulating double-word numbers (some classical, some new) was analyzed [16]. We have formally proven all the theorems given in that article, using the Coq proof assistant. The formal proof work led us to: (i) locate mistakes in some of the original paper proofs (mistakes that, however, do not hinder the validity of the algorithms), (ii) significantly improve some error bounds, and (iii) generalize some results by showing that they are still valid if we slightly change the rounding mode. The consequence is that the algorithms presented in [16] can be used with high confidence, and that some of them are even more accurate than what was believed before. This illustrates what formal proof can bring to computer arithmetic: beyond mere (yet extremely useful) verification, correction, and consolidation of already known results, it can help to find new properties. All our formal proofs are freely available.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"258 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86714386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Feature-complete SPIKE Dense Banded Solver 一个功能完整的钉密集带状求解器
Pub Date : 2020-10-16 DOI: 10.1145/3410153
Braegan S. Spring, E. Polizzi, A. Sameh
This article presents a parallel, effective, and feature-complete recursive SPIKE algorithm that achieves near feature-parity with the standard linear algebra package banded linear system solver. First, we present a flexible parallel implementation of the recursive SPIKE scheme that aims at removing its original limitation that the number of cores/processors be restricted to powers of two. A new transpose solve option for SPIKE is then developed to satisfy a standard requirement of most numerical solver libraries. Finally, a pivoting recursive SPIKE strategy is presented as an alternative to the non-pivoting scheme to improve numerical stability. All these new enhancements lead to the release of a new black-box feature-complete SPIKE-OpenMP package that significantly improves upon the performance and scalability obtained with other state-of-the-art banded solvers.
本文提出了一种并行、有效、特征完备的递归SPIKE算法,该算法与标准线性代数包带状线性系统求解器实现了近特征奇偶性。首先,我们提出了递归SPIKE方案的灵活并行实现,旨在消除其原始限制,即内核/处理器的数量限制为2的幂。然后开发了一个新的转置求解选项,以满足大多数数值求解程序库的标准要求。最后,提出了一种旋转递归SPIKE策略,作为非旋转方案的替代方案,以提高数值稳定性。所有这些新的增强导致了一个新的黑盒功能完整的SPIKE-OpenMP包的发布,该包显着提高了与其他最先进的带状求解器获得的性能和可扩展性。
{"title":"A Feature-complete SPIKE Dense Banded Solver","authors":"Braegan S. Spring, E. Polizzi, A. Sameh","doi":"10.1145/3410153","DOIUrl":"https://doi.org/10.1145/3410153","url":null,"abstract":"This article presents a parallel, effective, and feature-complete recursive SPIKE algorithm that achieves near feature-parity with the standard linear algebra package banded linear system solver. First, we present a flexible parallel implementation of the recursive SPIKE scheme that aims at removing its original limitation that the number of cores/processors be restricted to powers of two. A new transpose solve option for SPIKE is then developed to satisfy a standard requirement of most numerical solver libraries. Finally, a pivoting recursive SPIKE strategy is presented as an alternative to the non-pivoting scheme to improve numerical stability. All these new enhancements lead to the release of a new black-box feature-complete SPIKE-OpenMP package that significantly improves upon the performance and scalability obtained with other state-of-the-art banded solvers.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"57 1","pages":"1 - 35"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77643325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Variable Step-Size Control Based on Two-Steps for Radau IIA Methods 基于两步Radau IIA方法的变步长控制
Pub Date : 2020-10-16 DOI: 10.1145/3408892
S. G. Pinto, D. H. Abreu, J. I. Montijano
Two-step embedded methods of order s based on s-stage Radau IIA formulas are considered for the variable step-size integration of stiff differential equations. These embedded methods are aimed at local error control and are computed through a linear combination of the internal stages of the underlying method in the last two steps. Particular embedded methods for 2 ≤ s ≤ 7 internal stages with good stability properties and damping for the stiff components are constructed. Furthermore, a new formula for step-size change is proposed, having the advantage that it can be applied to any s-stage Radau IIA method. It is shown through numerical testing on some representative stiff problems that the RADAU5 code by Hairer and Wanner with the new strategy performs as well or even better as with the standard one, which is only feasible for an odd number of stages. Numerical experiments support the efficiency and flexibility of the new step-size change strategy.
针对刚性微分方程的变步长积分问题,考虑了基于s阶Radau IIA公式的s阶两步嵌入方法。这些嵌入方法旨在局部误差控制,并通过最后两个步骤中底层方法的内部阶段的线性组合来计算。构建了2≤s≤7个具有良好稳定性和刚性构件阻尼的内级的特殊嵌入方法。此外,还提出了一个新的步长变化公式,该公式可以适用于任何s级Radau IIA方法。通过对一些具有代表性的刚性问题的数值测试表明,采用新策略的RADAU5代码与采用标准策略的RADAU5代码性能相当,甚至更好,但标准策略只适用于奇数阶。数值实验验证了该方法的有效性和灵活性。
{"title":"Variable Step-Size Control Based on Two-Steps for Radau IIA Methods","authors":"S. G. Pinto, D. H. Abreu, J. I. Montijano","doi":"10.1145/3408892","DOIUrl":"https://doi.org/10.1145/3408892","url":null,"abstract":"Two-step embedded methods of order s based on s-stage Radau IIA formulas are considered for the variable step-size integration of stiff differential equations. These embedded methods are aimed at local error control and are computed through a linear combination of the internal stages of the underlying method in the last two steps. Particular embedded methods for 2 ≤ s ≤ 7 internal stages with good stability properties and damping for the stiff components are constructed. Furthermore, a new formula for step-size change is proposed, having the advantage that it can be applied to any s-stage Radau IIA method. It is shown through numerical testing on some representative stiff problems that the RADAU5 code by Hairer and Wanner with the new strategy performs as well or even better as with the standard one, which is only feasible for an odd number of stages. Numerical experiments support the efficiency and flexibility of the new step-size change strategy.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"12 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81040413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PHIST
Pub Date : 2020-10-16 DOI: 10.1145/3402227
J. Thies, Melven Röhrig-Zöllner, N. Overmars, A. Basermann, Dominik Ernst, G. Hager, G. Wellein
The increasing complexity of hardware and software environments in high-performance computing poses big challenges on the development of sustainable and hardware-efficient numerical software. This article addresses these challenges in the context of sparse solvers. Existing solutions typically target sustainability, flexibility, or performance, but rarely all of them. Our new library PHIST provides implementations of solvers for sparse linear systems and eigenvalue problems. It is a productivity platform for performance-aware developers of algorithms and application software with abstractions that do not obscure the view on hardware-software interaction. The PHIST software architecture and the PHIST development process were designed to overcome shortcomings of existing packages. An interface layer for basic sparse linear algebra functionality that can be provided by multiple backends ensures sustainability, and PHIST supports common techniques for improving scalability and performance of algorithms such as blocking and kernel fusion. We showcase these concepts using the PHIST implementation of a block Jacobi-Davidson solver for non-Hermitian and generalized eigenproblems. We study its performance on a multi-core CPU, a GPU, and a large-scale many-core system. Furthermore, we show how an existing implementation of a block Krylov-Schur method in the Trilinos package Anasazi can benefit from the performance engineering techniques used in PHIST.
高性能计算中日益复杂的硬件和软件环境对可持续的、硬件高效的数值软件的开发提出了巨大的挑战。本文在稀疏求解器的上下文中解决了这些挑战。现有的解决方案通常以可持续性、灵活性或性能为目标,但很少全部实现。我们的新库PHIST提供了稀疏线性系统和特征值问题求解器的实现。对于性能敏感的算法和应用软件开发人员来说,它是一个生产力平台,具有不会模糊硬件-软件交互视图的抽象。PHIST软件体系结构和PHIST开发过程旨在克服现有软件包的缺点。可由多个后端提供的用于基本稀疏线性代数功能的接口层确保了可持续性,并且PHIST支持用于提高可伸缩性和算法性能(如阻塞和核融合)的常用技术。我们使用PHIST实现非厄米和广义特征问题的块Jacobi-Davidson求解器来展示这些概念。我们研究了它在多核CPU、GPU和大规模多核系统上的性能。此外,我们还展示了Trilinos软件包Anasazi中现有的块Krylov-Schur方法的实现如何从PHIST中使用的性能工程技术中受益。
{"title":"PHIST","authors":"J. Thies, Melven Röhrig-Zöllner, N. Overmars, A. Basermann, Dominik Ernst, G. Hager, G. Wellein","doi":"10.1145/3402227","DOIUrl":"https://doi.org/10.1145/3402227","url":null,"abstract":"The increasing complexity of hardware and software environments in high-performance computing poses big challenges on the development of sustainable and hardware-efficient numerical software. This article addresses these challenges in the context of sparse solvers. Existing solutions typically target sustainability, flexibility, or performance, but rarely all of them. Our new library PHIST provides implementations of solvers for sparse linear systems and eigenvalue problems. It is a productivity platform for performance-aware developers of algorithms and application software with abstractions that do not obscure the view on hardware-software interaction. The PHIST software architecture and the PHIST development process were designed to overcome shortcomings of existing packages. An interface layer for basic sparse linear algebra functionality that can be provided by multiple backends ensures sustainability, and PHIST supports common techniques for improving scalability and performance of algorithms such as blocking and kernel fusion. We showcase these concepts using the PHIST implementation of a block Jacobi-Davidson solver for non-Hermitian and generalized eigenproblems. We study its performance on a multi-core CPU, a GPU, and a large-scale many-core system. Furthermore, we show how an existing implementation of a block Krylov-Schur method in the Trilinos package Anasazi can benefit from the performance engineering techniques used in PHIST.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"283 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73398264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Algorithm 1011 算法1011
Pub Date : 2020-09-15 DOI: 10.1145/3408891
Thomas Mejstrik
In several papers of 2013–2016, Guglielmi and Protasov made a breakthrough in the problem of the joint spectral radius computation, developing the invariant polytope algorithm that for most matrix families finds the exact value of the joint spectral radius. This algorithm found many applications in problems of functional analysis, approximation theory, combinatorics, and so on. In this article, we propose a modification of the invariant polytope algorithm making it roughly 3 times faster (single threaded), suitable for higher dimensions, and parallelise it. The modified version works for most matrix families of dimensions up to 25, for non-negative matrices up to 3,000. In addition, we introduce a new, fast algorithm, called modified Gripenberg algorithm, for computing good lower bounds for the joint spectral radius. The corresponding examples and statistics of numerical results are provided. Several applications of our algorithms are presented. In particular, we find the exact values of the regularity exponents of Daubechies wavelets up to order 42 and the capacities of codes that avoid certain difference patterns.
在2013-2016年的几篇论文中,Guglielmi和Protasov在联合谱半径计算问题上取得了突破,发展了对大多数矩阵族都能找到联合谱半径精确值的不变多边形算法。该算法在泛函分析、近似理论、组合学等问题中得到了广泛的应用。在本文中,我们提出了对不变多面体算法的修改,使其大约快3倍(单线程),适合高维,并并行化它。修改后的版本适用于维度不超过25的大多数矩阵族,以及不超过3000的非负矩阵。此外,我们还引入了一种新的快速算法,称为改进Gripenberg算法,用于计算关节谱半径的良好下界。给出了相应的算例和数值结果统计。给出了算法的几个应用。特别地,我们发现了高达42阶的多贝西小波的正则指数的精确值,以及码避免某些差异模式的能力。
{"title":"Algorithm 1011","authors":"Thomas Mejstrik","doi":"10.1145/3408891","DOIUrl":"https://doi.org/10.1145/3408891","url":null,"abstract":"In several papers of 2013–2016, Guglielmi and Protasov made a breakthrough in the problem of the joint spectral radius computation, developing the invariant polytope algorithm that for most matrix families finds the exact value of the joint spectral radius. This algorithm found many applications in problems of functional analysis, approximation theory, combinatorics, and so on. In this article, we propose a modification of the invariant polytope algorithm making it roughly 3 times faster (single threaded), suitable for higher dimensions, and parallelise it. The modified version works for most matrix families of dimensions up to 25, for non-negative matrices up to 3,000. In addition, we introduce a new, fast algorithm, called modified Gripenberg algorithm, for computing good lower bounds for the joint spectral radius. The corresponding examples and statistics of numerical results are provided. Several applications of our algorithms are presented. In particular, we find the exact values of the regularity exponents of Daubechies wavelets up to order 42 and the capacities of codes that avoid certain difference patterns.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"20 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2020-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83323323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function ex 标量结构的多项式求值,应用于初等函数ex
Pub Date : 2020-09-15 DOI: 10.1145/3408893
Timothée Ewart, Francesco Cremonesi, F. Schürmann, F. Delalondre
The evaluation of small degree polynomials is critical for the computation of elementary functions. It has been extensively studied and is well documented. In this article, we evaluate existing methods for polynomial evaluation on superscalar architecture. In addition, we have completed this work with a factorization method, which is surprisingly neglected in the literature. This work focuses on out-of-order Intel processors, amongst others, of which computational units are available. Moreover, we applied our work on the elementary function ex that requires, in the current implementation, an evaluation of a polynomial of degree 10 for a satisfying precision and performance. Our results show that the factorization scheme is the fastest in benchmarks, and that latency and throughput are intrinsically dependent on each other on superscalar architecture.
小次多项式的求值是初等函数计算的关键。它已被广泛研究,并有充分的记录。在本文中,我们评估了现有的在超标量体系结构上的多项式求值方法。此外,我们还使用因子分解方法完成了这项工作,这在文献中令人惊讶地被忽略了。这项工作的重点是无序的英特尔处理器,其中计算单元是可用的。此外,我们将我们的工作应用于初等函数ex,在当前的实现中,需要对10次多项式进行评估,以获得令人满意的精度和性能。结果表明,该分解方案在基准测试中是最快的,并且在超标量架构上延迟和吞吐量本质上是相互依赖的。
{"title":"Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function ex","authors":"Timothée Ewart, Francesco Cremonesi, F. Schürmann, F. Delalondre","doi":"10.1145/3408893","DOIUrl":"https://doi.org/10.1145/3408893","url":null,"abstract":"The evaluation of small degree polynomials is critical for the computation of elementary functions. It has been extensively studied and is well documented. In this article, we evaluate existing methods for polynomial evaluation on superscalar architecture. In addition, we have completed this work with a factorization method, which is surprisingly neglected in the literature. This work focuses on out-of-order Intel processors, amongst others, of which computational units are available. Moreover, we applied our work on the elementary function ex that requires, in the current implementation, an evaluation of a polynomial of degree 10 for a satisfying precision and performance. Our results show that the factorization scheme is the fastest in benchmarks, and that latency and throughput are intrinsically dependent on each other on superscalar architecture.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"1 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2020-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84015150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
BiqBin: A Parallel Branch-and-bound Solver for Binary Quadratic Problems with Linear Constraints 具有线性约束的二元二次问题的并行分支定界求解器
Pub Date : 2020-09-14 DOI: 10.1145/3514039
Nicoló Gusmeroli, T. Hrga, Borut Lužar, J. Povh, Melanie Siebenhofer, Angelika Wiegele
We present BiqBin, an exact solver for linearly constrained binary quadratic problems. Our approach is based on an exact penalty method to first efficiently transform the original problem into an instance of Max-Cut, and then to solve the Max-Cut problem by a branch-and-bound algorithm. All the main ingredients are carefully developed using new semidefinite programming relaxations obtained by strengthening the existing relaxations with a set of hypermetric inequalities, applying the bundle method as the bounding routine and using new strategies for exploring the branch-and-bound tree. Furthermore, an efficient C implementation of a sequential and a parallel branch-and-bound algorithm is presented. The latter is based on a load coordinator-worker scheme using MPI for multi-node parallelization and is evaluated on a high-performance computer. The new solver is benchmarked against BiqCrunch, GUROBI, and SCIP on four families of (linearly constrained) binary quadratic problems. Numerical results demonstrate that BiqBin is a highly competitive solver. The serial version outperforms the other three solvers on the majority of the benchmark instances. We also evaluate the parallel solver and show that it has good scaling properties. The general audience can use it as an on-line service available at http://www.biqbin.eu.
我们提出了BiqBin,一个线性约束二元二次问题的精确求解器。该方法基于精确惩罚法,首先将原始问题有效地转化为Max-Cut实例,然后利用分支定界算法求解Max-Cut问题。利用利用一组超度量不等式加强现有松弛得到的新的半定规划松弛,采用束法作为边界例程,并采用新的策略来探索分支定界树,仔细地开发了所有主要成分。在此基础上,给出了一种序列和并行分支定界算法的高效C语言实现。后者基于使用MPI进行多节点并行化的负载协调器-工作器方案,并在高性能计算机上进行了评估。新的求解器在四类(线性约束)二元二次问题上对BiqCrunch, GUROBI和SCIP进行了基准测试。数值结果表明,BiqBin是一个极具竞争力的求解器。串行版本在大多数基准测试实例上的性能优于其他三个解算器。我们还评估了并行求解器,并表明它具有良好的缩放性能。一般观众可以在http://www.biqbin.eu上使用它作为在线服务。
{"title":"BiqBin: A Parallel Branch-and-bound Solver for Binary Quadratic Problems with Linear Constraints","authors":"Nicoló Gusmeroli, T. Hrga, Borut Lužar, J. Povh, Melanie Siebenhofer, Angelika Wiegele","doi":"10.1145/3514039","DOIUrl":"https://doi.org/10.1145/3514039","url":null,"abstract":"We present BiqBin, an exact solver for linearly constrained binary quadratic problems. Our approach is based on an exact penalty method to first efficiently transform the original problem into an instance of Max-Cut, and then to solve the Max-Cut problem by a branch-and-bound algorithm. All the main ingredients are carefully developed using new semidefinite programming relaxations obtained by strengthening the existing relaxations with a set of hypermetric inequalities, applying the bundle method as the bounding routine and using new strategies for exploring the branch-and-bound tree. Furthermore, an efficient C implementation of a sequential and a parallel branch-and-bound algorithm is presented. The latter is based on a load coordinator-worker scheme using MPI for multi-node parallelization and is evaluated on a high-performance computer. The new solver is benchmarked against BiqCrunch, GUROBI, and SCIP on four families of (linearly constrained) binary quadratic problems. Numerical results demonstrate that BiqBin is a highly competitive solver. The serial version outperforms the other three solvers on the majority of the benchmark instances. We also evaluate the parallel solver and show that it has good scaling properties. The general audience can use it as an on-line service available at http://www.biqbin.eu.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"1 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2020-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88629732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Algorithms for Efficient Reproducible Floating Point Summation 高效可重复浮点求和算法
Pub Date : 2020-07-21 DOI: 10.1145/3389360
Peter Ahrens, J. Demmel, Hong Diep Nguyen
We define “reproducibility” as getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not affect the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociative floating point addition, makes reproducibility challenging even for summation, or operations like the BLAS. We describe a “reproducible accumulator” data structure (the “binned number”) and associated algorithms to reproducibly sum binary floating point numbers, independent of summation order. We use a subset of the IEEE Floating Point Standard 754-2008 and bitwise operations on the standard representations in memory. Our approach requires only one read-only pass over the data, and one reduction in parallel, using a 6-word reproducible accumulator (more words can be used for higher accuracy), enabling standard tiling optimization techniques. Summing n words with a 6-word reproducible accumulator requires approximately 9n floating point operations (arithmetic, comparison, and absolute value) and approximately 3n bitwise operations. The final error bound with a 6-word reproducible accumulator and our default settings can be up to 229 times smaller than the error bound for conventional (recursive) summation on ill-conditioned double-precision inputs.
我们将“可重复性”定义为从同一程序的多次运行中获得按位相同的结果,可能使用不同的硬件资源或其他不应影响答案的更改。许多用户依赖于可再现性来进行调试或正确性。但是,并行计算资源的动态调度与非关联浮点加法相结合,即使对于求和或BLAS之类的操作,也会使再现性受到挑战。我们描述了一个“可重复累加器”数据结构(“二进制数”)和相关的算法来可重复地和二进制浮点数,与求和顺序无关。我们使用IEEE浮点标准754-2008的一个子集,并对内存中的标准表示进行位操作。我们的方法只需要对数据进行一次只读传递,并使用6个单词的可重复累加器(更多的单词可以用于更高的精度)并行地进行一次缩减,从而支持标准的平铺优化技术。使用6字可重复累加器对n个单词求和需要大约9n个浮点操作(算术、比较和绝对值)和大约3n个位操作。使用6字可重复累加器和我们的默认设置的最终误差界可以比在条件恶劣的双精度输入上进行传统(递归)求和的误差界小229倍。
{"title":"Algorithms for Efficient Reproducible Floating Point Summation","authors":"Peter Ahrens, J. Demmel, Hong Diep Nguyen","doi":"10.1145/3389360","DOIUrl":"https://doi.org/10.1145/3389360","url":null,"abstract":"We define “reproducibility” as getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not affect the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociative floating point addition, makes reproducibility challenging even for summation, or operations like the BLAS. We describe a “reproducible accumulator” data structure (the “binned number”) and associated algorithms to reproducibly sum binary floating point numbers, independent of summation order. We use a subset of the IEEE Floating Point Standard 754-2008 and bitwise operations on the standard representations in memory. Our approach requires only one read-only pass over the data, and one reduction in parallel, using a 6-word reproducible accumulator (more words can be used for higher accuracy), enabling standard tiling optimization techniques. Summing n words with a 6-word reproducible accumulator requires approximately 9n floating point operations (arithmetic, comparison, and absolute value) and approximately 3n bitwise operations. The final error bound with a 6-word reproducible accumulator and our default settings can be up to 229 times smaller than the error bound for conventional (recursive) summation on ill-conditioned double-precision inputs.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"56 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2020-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74553113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
ACM Transactions on Mathematical Software (TOMS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1