ACM Transactions on Mathematical Software最新文献

英文中文

Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors 在英特尔、AMD和富士通处理器上批量、小矩阵和矩形矩阵乘法的缓存优化和性能建模

1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-09-19 DOI: 10.1145/3595178

Sameer Deshmukh, Rio Yokota, George Bosilca

Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on three CPUs using diverse ISAs – the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512, and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes.

密集矩阵和张量的因式分解和乘法是科学工具箱中至关重要但又极其昂贵的部分。小心地使用低秩近似可以大大减少这些操作的计算和内存需求。除了较低的算术复杂度外，这些方法还可以通过其结构设计来有效地利用现代硬件体系结构。现有的大部分工作依赖于批处理的BLAS库来处理许多小的密集矩阵的计算。我们表明，通过仔细分析缓存利用率、使用SIMD寄存器的寄存器积累以及重新设计实现，可以在大范围的块和批大小中为这些类型的批处理低秩矩阵实现更高的吞吐量。我们使用不同的isa在三个CPU上测试我们的算法-使用ARM SVE的富士通A64FX，使用AVX-512的英特尔至强6148和使用AVX-2的AMD EPYC 7502，并表明我们的新批处理方法能够获得超过两倍的供应商优化库的吞吐量，适用于所有CPU架构和问题大小。

{"title":"Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors","authors":"Sameer Deshmukh, Rio Yokota, George Bosilca","doi":"10.1145/3595178","DOIUrl":"https://doi.org/10.1145/3595178","url":null,"abstract":"Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on three CPUs using diverse ISAs – the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512, and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135059806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New subspace method for unconstrained derivative-free optimization 无约束无导数优化的新子空间方法

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-09-02 DOI: 10.1145/3618297

M. Kimiaei, A. Neumaier, Parvaneh Faramarzi

This paper defines an efficient subspace method, called SSDFO, for unconstrained derivative-free optimization problems where the gradients of the objective function are Lipschitz continuous but only exact function values are available. SSDFO employs line searches along directions constructed on the basis of quadratic models. These approximate the objective function in a subspace spanned by some previous search directions. A worst case complexity bound on the number of iterations and function evaluations is derived for a basic algorithm using this technique. Numerical results for a practical variant with additional heuristic features show that, on the unconstrained CUTEst test problems, SSDFO has superior performance compared to the best solvers from the literature.

本文针对目标函数的梯度为Lipschitz连续但只有精确函数值的无约束无导数优化问题，定义了一种有效的子空间方法SSDFO。SSDFO采用基于二次模型构建的方向进行直线搜索。这些近似的目标函数在一个子空间由一些先前的搜索方向。对于使用该技术的基本算法，导出了迭代次数和函数求值的最坏情况复杂度界限。对具有附加启发式特征的实际变体的数值结果表明，在无约束CUTEst测试问题上，SSDFO与文献中的最佳求解器相比具有优越的性能。

引用次数: 1

IEEE-754 precision-p base-β arithmetic implemented in binary 用二进制实现的IEEE-754精度-p基-β算术

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-08-21 DOI: 10.1145/3596218

S. Rump

We show how an IEEE-754 conformant precision-p base-β arithmetic can be implemented based on some binary floating-point and/or integer arithmetic. This includes the four basic operations and square root subject to the five IEEE-754 rounding modes, namely the nearest roundings with roundTiesToEven and roundTiesToAway, the directed roundings downwards and upwards, as well as rounding towards zero. Exceptional values like ∞ of NaN are covered according to the IEEE-754 arithmetic standard. The results of the precision-p base-β operations are computed using some underlying precision-q binary arithmetic. We distinguish two cases. When using a precision-q binary integer arithmetic, the base-β precision p is limited for all operations by β2p ≤ 2q, whereas using a precision-q binary floating-point arithmetic imposes stronger limits on the base-β precision, namely β2p ≤ 2q for addition and multiplication, β2p ≤ 2q − 1 for division and β2p ≤ 2q − 3 for the square root. Those limitations cannot be improved. The algorithms are implemented in a Matlab/Octave flbeta-toolbox with the choice of using uint64 or binary64 as underlying arithmetic. The former allows larger precisions, the latter is advantageous for the square root, whereas computing times are similar. The flbeta-toolbox offers precision-p base-β scalar, vector and matrix operations including sparse matrices as well as corresponding interval operations. The base β can be chosen in the range β ∈ [2, 64]. The flbeta-toolbox will be part of Version 13 of INTLAB [18], the Matlab/Octave toolbox for reliable computing.

我们展示了如何在一些二进制浮点和/或整数算法的基础上实现符合IEEE-754的精度-p基-β算法。这包括四个基本运算和服从五种IEEE-754舍入模式的平方根，即具有roundTiesToEven和roundTietToAway的最近舍入、向下和向上的定向舍入以及向零舍入。根据IEEE-754算法标准，NaN的∞等异常值被覆盖。精度-p基-β运算的结果是使用一些基本的精度-q二进制算法计算的。我们区分两种情况。当使用精度为q的二进制整数运算时，所有运算的基本β精度p都受到β2p≤2q的限制，而使用精度为q的二进制浮点运算对基本β精度施加了更强的限制，即加法和乘法的β2p≥2q，除法的β2pp≤2q−1，平方根的β2ps≤2q-3。这些限制是无法改善的。这些算法在Matlab/Octave flbeta工具箱中实现，可以选择使用uint64或binary64作为底层算法。前者允许更大的精度，后者有利于平方根，而计算时间相似。flbeta工具箱提供精确-p基-β标量、向量和矩阵运算，包括稀疏矩阵以及相应的区间运算。基β可以在β∈[2，64]的范围内选择。flbeta工具箱将是INTLAB[18]第13版的一部分，该版本是用于可靠计算的Matlab/Octave工具箱。

{"title":"IEEE-754 precision-p base-β arithmetic implemented in binary","authors":"S. Rump","doi":"10.1145/3596218","DOIUrl":"https://doi.org/10.1145/3596218","url":null,"abstract":"We show how an IEEE-754 conformant precision-p base-β arithmetic can be implemented based on some binary floating-point and/or integer arithmetic. This includes the four basic operations and square root subject to the five IEEE-754 rounding modes, namely the nearest roundings with roundTiesToEven and roundTiesToAway, the directed roundings downwards and upwards, as well as rounding towards zero. Exceptional values like ∞ of NaN are covered according to the IEEE-754 arithmetic standard. The results of the precision-p base-β operations are computed using some underlying precision-q binary arithmetic. We distinguish two cases. When using a precision-q binary integer arithmetic, the base-β precision p is limited for all operations by β2p ≤ 2q, whereas using a precision-q binary floating-point arithmetic imposes stronger limits on the base-β precision, namely β2p ≤ 2q for addition and multiplication, β2p ≤ 2q − 1 for division and β2p ≤ 2q − 3 for the square root. Those limitations cannot be improved. The algorithms are implemented in a Matlab/Octave flbeta-toolbox with the choice of using uint64 or binary64 as underlying arithmetic. The former allows larger precisions, the latter is advantageous for the square root, whereas computing times are similar. The flbeta-toolbox offers precision-p base-β scalar, vector and matrix operations including sparse matrices as well as corresponding interval operations. The base β can be chosen in the range β ∈ [2, 64]. The flbeta-toolbox will be part of Version 13 of INTLAB [18], the Matlab/Octave toolbox for reliable computing.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41531528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Algorithm xxxx: KCC: A MATLAB Package for K-means-based Consensus Clustering 算法xxxx: KCC:基于k均值的共识聚类的MATLAB包

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-08-15 DOI: 10.1145/3616011

Hao Lin, Hongfu Liu, Junjie Wu, Hong Li, Stephan Günnemann

Consensus clustering is gaining increasing attention for its high quality and robustness. In particular, K-means-based Consensus Clustering (KCC) converts the usual computationally expensive problem to a classic K-means clustering with generalized utility functions, bringing potentials for large-scale data clustering on different types of data. Despite KCC’s applicability and generalizability, implementing this method such as representing the binary data set in the K-means heuristic is challenging, and has seldom been discussed in prior work. To fill this gap, we present a MATLAB package, KCC, that completely implements the KCC framework, and utilizes a sparse representation technique to achieve a low space complexity. Compared to alternative consensus clustering packages, the KCC package is of high flexibility, efficiency, and effectiveness. Extensive numerical experiments are also included to show its usability on real-world data sets.

一致性聚类以其高质量和鲁棒性而日益受到关注。特别是，基于K-means的一致性聚类（KCC）将通常计算成本高昂的问题转化为具有广义效用函数的经典K-means聚类，为在不同类型的数据上进行大规模数据聚类带来了潜力。尽管KCC具有适用性和可推广性，但实现这种方法（如在K-means启发式中表示二进制数据集）是具有挑战性的，并且在以前的工作中很少讨论。为了填补这一空白，我们提出了一个MATLAB包KCC，它完全实现了KCC框架，并利用稀疏表示技术来实现低空间复杂度。与其他共识集群包相比，KCC包具有很高的灵活性、效率和有效性。还包括大量的数值实验，以显示其在真实世界数据集上的可用性。

引用次数: 1

Sparse Approximate Multifrontal Factorization with Composite Compression Methods 复合压缩方法的稀疏近似多前沿因子分解

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-08-01 DOI: 10.1145/3611662

Lisa Claus, P. Ghysels, Yang Liu, T. Nhan, R. Thirumalaisamy, A. Bhalla, Sherry Li

This article presents a fast and approximate multifrontal solver for large sparse linear systems. In a recent work by Liu et al., we showed the efficiency of a multifrontal solver leveraging the butterfly algorithm and its hierarchical matrix extension, HODBF (hierarchical off-diagonal butterfly) compression to compress large frontal matrices. The resulting multifrontal solver can attain quasi-linear computation and memory complexity when applied to sparse linear systems arising from spatial discretization of high-frequency wave equations. To further reduce the overall number of operations and especially the factorization memory usage to scale to larger problem sizes, in this article we develop a composite multifrontal solver that employs the HODBF format for large-sized fronts, a reduced-memory version of the nonhierarchical block low-rank format for medium-sized fronts, and a lossy compression format for small-sized fronts. This allows us to solve sparse linear systems of dimension up to 2.7 × larger than before and leads to a memory consumption that is reduced by 70% while ensuring the same execution time. The code is made publicly available in GitHub.

本文提出了一种适用于大型稀疏线性系统的快速近似多平面解算器。在刘等人最近的一项工作中，我们展示了利用蝶形算法及其分层矩阵扩展HODBF（分层非对角蝶形）压缩来压缩大的前沿矩阵的多前沿求解器的效率。当应用于由高频波动方程的空间离散化引起的稀疏线性系统时，所得到的多平面解算器可以获得准线性计算和记忆复杂性。为了进一步减少操作的总数，特别是因子分解内存的使用，以扩展到更大的问题大小，在本文中，我们开发了一种复合多前沿求解器，该求解器对大尺寸前沿采用HODBF格式，对中型前沿采用非分层块低秩格式的缩减内存版本，对小型前沿采用有损压缩格式。这使我们能够解决比以前大2.7倍的稀疏线性系统，并在确保相同执行时间的同时减少70%的内存消耗。该代码在GitHub中公开。

{"title":"Sparse Approximate Multifrontal Factorization with Composite Compression Methods","authors":"Lisa Claus, P. Ghysels, Yang Liu, T. Nhan, R. Thirumalaisamy, A. Bhalla, Sherry Li","doi":"10.1145/3611662","DOIUrl":"https://doi.org/10.1145/3611662","url":null,"abstract":"This article presents a fast and approximate multifrontal solver for large sparse linear systems. In a recent work by Liu et al., we showed the efficiency of a multifrontal solver leveraging the butterfly algorithm and its hierarchical matrix extension, HODBF (hierarchical off-diagonal butterfly) compression to compress large frontal matrices. The resulting multifrontal solver can attain quasi-linear computation and memory complexity when applied to sparse linear systems arising from spatial discretization of high-frequency wave equations. To further reduce the overall number of operations and especially the factorization memory usage to scale to larger problem sizes, in this article we develop a composite multifrontal solver that employs the HODBF format for large-sized fronts, a reduced-memory version of the nonhierarchical block low-rank format for medium-sized fronts, and a lossy compression format for small-sized fronts. This allows us to solve sparse linear systems of dimension up to 2.7 × larger than before and leads to a memory consumption that is reduced by 70% while ensuring the same execution time. The code is made publicly available in GitHub.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 28"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45941947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

emgr – EMpirical GRamian Framework Version 5.99 emgr -经验语法框架版本5.99

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-07-20 DOI: https://dl.acm.org/doi/10.1145/3609860

Christian Himpe

Version 5.99 of the empirical Gramian framework – emgr – completes a development cycle which focused on parametric model order reduction of gas network models while preserving compatibility to the previous development for the application of combined state and parameter reduction for neuroscience network models. Secondarily, new features concerning empirical Gramian types, perturbation design, and trajectory post-processing, as well as a Python version in addition to the default MATLAB / Octave implementation, have been added. This work summarizes these changes, particularly since emgr version 5.4, see Himpe, 2018 [Algorithms 11(7): 91], and gives recent as well as future applications, such as parameter identification in systems biology, based on the current feature set.

经验Gramian框架的5.99版本- emgr -完成了一个开发周期，重点是气体网络模型的参数模型降阶，同时保留了对神经科学网络模型组合状态和参数降阶应用的先前开发的兼容性。其次，添加了有关经验Gramian类型，摄动设计和轨迹后处理的新功能，以及默认MATLAB / Octave实现之外的Python版本。这项工作总结了这些变化，特别是自emgr 5.4版本以来，参见Himpe, 2018[算法11(7):91]，并给出了最近和未来的应用，例如基于当前特征集的系统生物学中的参数识别。

引用次数: 0

IFISS3D: A computational laboratory for investigating finite element approximation in three dimensions IFISS3D:一个用于研究三维有限元近似的计算实验室

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3604934

Georgios Papanikos, Catherine E. Powell, David J. Silvester

IFISS is an established MATLAB finite element software package for studying strategies for solving partial differential equations (PDEs). IFISS3D is a new add-on toolbox that extends IFISS capabilities for elliptic PDEs from two to three space dimensions. The open-source MATLAB framework provides a computational laboratory for experimentation and exploration of finite element approximation and error estimation, as well as iterative solvers. The package is designed to be useful as a teaching tool for instructors and students who want to learn about state-of-the-art finite element methodology. It will also be useful for researchers as a source of reproducible test matrices of arbitrarily large dimension.

IFISS是一个成熟的MATLAB有限元软件包，用于研究偏微分方程(PDEs)求解策略。IFISS3D是一个新的附加工具箱，它将IFISS功能扩展到椭圆偏微分方程的二维到三维空间。开源的MATLAB框架为有限元逼近和误差估计以及迭代求解提供了实验和探索的计算实验室。该软件包的设计是有用的教学工具，为教师和学生谁想要了解国家的最先进的有限元方法。对于研究人员来说，它也可以作为任意大尺寸可重复测试矩阵的来源。

引用次数: 0

Approximating inverse cumulative distribution functions to produce approximate random variables 近似逆累积分布函数以产生近似随机变量

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-06-17 DOI: https://dl.acm.org/doi/10.1145/3604935

Michael Giles, Oliver Sheridan-Methven

For random variables produced through the inverse transform method, approximate random variables are introduced, which are produced using approximations to a distribution’s inverse cumulative distribution function. These approximations are designed to be computationally inexpensive, and much cheaper than library functions which are exact to within machine precision, and thus highly suitable for use in Monte Carlo simulations. The approximation errors they introduce can then be eliminated through use of the multilevel Monte Carlo method. Two approximations are presented for the Gaussian distribution: a piecewise constant on equally spaced intervals, and a piecewise linear using geometrically decaying intervals. The errors of the approximations are bounded and the convergence demonstrated, and the computational savings measured for C and C++ implementations. Implementations tailored for Intel and Arm hardware are inspected, alongside hardware agnostic implementations built using OpenMP. The savings are incorporated into a nested multilevel Monte Carlo framework with the Euler-Maruyama scheme to exploit the speed ups without losing accuracy, offering speed ups by a factor of 5–7. These ideas are empirically extended to the Milstein scheme, and the non-central χ² distribution for the Cox-Ingersoll-Ross process, offering speed ups of a factor of 250 or more.

对于通过逆变换方法产生的随机变量，引入近似随机变量，近似随机变量是通过近似分布的逆累积分布函数产生的。这些近似被设计成计算成本低廉，比精确到机器精度的库函数便宜得多，因此非常适合在蒙特卡罗模拟中使用。它们引入的近似误差可以通过使用多层蒙特卡罗方法来消除。提出了高斯分布的两种近似:在等间隔上的分段常数和在几何衰减间隔上的分段线性。近似的误差是有限的，并且证明了收敛性，并且测量了C和c++实现的计算节省。为英特尔和Arm硬件量身定制的实现，以及使用OpenMP构建的硬件不可知实现进行了检查。节省的费用与欧拉-丸山方案合并到嵌套的多层蒙特卡罗框架中，在不损失精度的情况下利用加速，提供5-7倍的速度提升。这些想法在经验上被扩展到米尔斯坦方案，以及Cox-Ingersoll-Ross过程的非中心χ2分布，提供了250倍或更多的加速因子。

{"title":"Approximating inverse cumulative distribution functions to produce approximate random variables","authors":"Michael Giles, Oliver Sheridan-Methven","doi":"https://dl.acm.org/doi/10.1145/3604935","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3604935","url":null,"abstract":"For random variables produced through the inverse transform method, approximate random variables are introduced, which are produced using approximations to a distribution’s inverse cumulative distribution function. These approximations are designed to be computationally inexpensive, and much cheaper than library functions which are exact to within machine precision, and thus highly suitable for use in Monte Carlo simulations. The approximation errors they introduce can then be eliminated through use of the multilevel Monte Carlo method. Two approximations are presented for the Gaussian distribution: a piecewise constant on equally spaced intervals, and a piecewise linear using geometrically decaying intervals. The errors of the approximations are bounded and the convergence demonstrated, and the computational savings measured for C and C++ implementations. Implementations tailored for Intel and Arm hardware are inspected, alongside hardware agnostic implementations built using OpenMP. The savings are incorporated into a nested multilevel Monte Carlo framework with the Euler-Maruyama scheme to exploit the speed ups without losing accuracy, offering speed ups by a factor of 5–7. These ideas are empirically extended to the Milstein scheme, and the non-central χ2 distribution for the Cox-Ingersoll-Ross process, offering speed ups of a factor of 250 or more.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"86 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138537789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CPFloat: A C Library for Simulating Low-precision Arithmetic 一个模拟低精度算术的C语言库

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-06-17 DOI: https://dl.acm.org/doi/10.1145/3585515

Massimiliano Fasi, Mantas Mikaitis

One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic.

可以通过软件模拟低精度浮点运算，方法是在硬件中执行每个算术运算，然后将结果四舍五入到所需的有效位数。对于符合ieee的格式，舍入只需要标准的数学库函数，但是处理次法线、下溢和溢出需要特别注意，并且数值错误可能导致数学上正确的公式在有限算术中表现不正确。此外，随后的实现不一定是高效的，因为构建这些技术的库函数通常是为处理广泛的情况而设计的，可能没有针对舍入算法的特定需求进行优化。CPFloat是一个用于模拟低精度算术的C库。它为舍入、执行数学计算和查询模拟低精度格式的属性提供了有效的例程。该软件利用存储数字的格式的位级浮点表示，并用低级位操作和整数运算取代昂贵的库调用。在数值实验中，与现有的C、c++和MATLAB替代方案相比，新技术带来了相当大的加速(通常是一个数量级或更多)。据我们所知，CPFloat是目前用于实验自定义低精度浮点算法的最有效和最完整的库。

{"title":"CPFloat: A C Library for Simulating Low-precision Arithmetic","authors":"Massimiliano Fasi, Mantas Mikaitis","doi":"https://dl.acm.org/doi/10.1145/3585515","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3585515","url":null,"abstract":"One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"69 ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task-based Parallel Programming for Scalable Matrix Product Algorithms 基于任务的可扩展矩阵积算法并行编程

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-06-15 DOI: https://dl.acm.org/doi/10.1145/3583560

Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, Julien Herrmann, Antoine Jego

Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write compact yet efficient and scalable routines for linear algebra computations. Although, this work focuses on dense General Matrix Multiplication, the proposed features enable the implementation of more complex algorithms. We describe the implementation of these features and of the resulting GEMM operation. Finally, we present an experimental analysis on two homogeneous supercomputers showing that our approach is competitive up to 32,768 CPU cores with state-of-the-art libraries and may outperform them for some problem dimensions. Although our code can use GPUs straightforwardly, we do not deal with this case because it implies other issues which are out of the scope of this work.

基于任务的编程模型已经成功地获得了高性能数学软件社区的兴趣，因为它们以一种高效和可移植的方式减轻了开发和实现分布式内存并行算法的部分负担。在越来越大、越来越异构的计算机集群中，这些模型似乎是维护和增强更复杂算法的一种方式。然而，基于任务的编程模型缺乏灵活性和功能，而这些灵活性和功能是以一种优雅而紧凑的方式表达依赖于高级通信模式的可扩展算法所必需的。我们展示了顺序任务流范式可以扩展到编写紧凑但高效和可扩展的线性代数计算例程。虽然这项工作的重点是密集的一般矩阵乘法，但所提出的特征可以实现更复杂的算法。我们将描述这些特性的实现以及由此产生的GEMM操作。最后，我们在两台同构超级计算机上进行了实验分析，结果表明，我们的方法在拥有最先进库的32,768个CPU内核的情况下具有竞争力，并且在某些问题维度上可能优于它们。虽然我们的代码可以直接使用gpu，但我们不处理这种情况，因为它暗示了超出本工作范围的其他问题。

{"title":"Task-based Parallel Programming for Scalable Matrix Product Algorithms","authors":"Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, Julien Herrmann, Antoine Jego","doi":"https://dl.acm.org/doi/10.1145/3583560","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3583560","url":null,"abstract":"Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write compact yet efficient and scalable routines for linear algebra computations. Although, this work focuses on dense General Matrix Multiplication, the proposed features enable the implementation of more complex algorithms. We describe the implementation of these features and of the resulting GEMM operation. Finally, we present an experimental analysis on two homogeneous supercomputers showing that our approach is competitive up to 32,768 CPU cores with state-of-the-art libraries and may outperform them for some problem dimensions. Although our code can use GPUs straightforwardly, we do not deal with this case because it implies other issues which are out of the scope of this work.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"63 ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Transactions on Mathematical Software

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀