ACM Transactions on Mathematical Software最新文献_第6页

A Geometric Multigrid Method for Space-Time Finite Element Discretizations of the Navier–Stokes Equations and its Application to 3D Flow Simulation Navier-Stokes方程时空有限元离散的几何多重网格方法及其在三维流动模拟中的应用

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-03-21 DOI: https://dl.acm.org/doi/10.1145/3582492

Mathias Anselmann, Markus Bause

We present a parallelized geometric multigrid (GMG) method, based on the cell-based Vanka smoother, for higher order space-time finite element methods (STFEM) to the incompressible Navier–Stokes equations. The STFEM is implemented as a time marching scheme. The GMG solver is applied as a preconditioner for generalized minimal residual iterations. Its performance properties are demonstrated for 2D and 3D benchmarks of flow around a cylinder. The key ingredients of the GMG approach are the construction of the local Vanka smoother over all degrees of freedom in time of the respective subinterval and its efficient application. For this, data structures that store pre-computed cell inverses of the Jacobian for all hierarchical levels and require only a reasonable amount of memory overhead are generated. The GMG method is built for the deal.II finite element library. The concepts are flexible and can be transferred to similar software platforms.

针对不可压缩Navier-Stokes方程的高阶时空有限元方法，提出了一种基于Vanka平滑的并行几何多网格(GMG)方法。STFEM是一种时间推进方案。将GMG求解器作为广义最小残差迭代的预条件。它的性能性能证明了二维和三维基准的流动围绕一个圆柱体。GMG方法的关键是在各个子区间的所有自由度上构建局部Vanka平滑及其有效应用。为此，生成的数据结构存储所有层次级别的预先计算的雅可比矩阵的单元逆，并且只需要合理数量的内存开销。GMG方法是为该交易构建的。II有限元库。这些概念是灵活的，可以转移到类似的软件平台。

引用次数: 0

Algorithm 1033: Parallel Implementations for Computing the Minimum Distance of a Random Linear Code on Distributed-memory Architectures 算法1033:分布式存储结构下随机线性码最小距离计算的并行实现

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-03-21 DOI: https://dl.acm.org/doi/10.1145/3573383

Gregorio Quintana-Ortí, Fernando Hernando, Francisco D. Igual

The minimum distance of a linear code is a key concept in information theory. Therefore, the time required by its computation is very important to many problems in this area. In this article, we introduce a family of implementations of the Brouwer–Zimmermann algorithm for distributed-memory architectures for computing the minimum distance of a random linear code over 𝔽₂. Both current commercial and public-domain software only work on either unicore architectures or shared-memory architectures, which are limited in the number of cores/processors employed in the computation. Our implementations focus on distributed-memory architectures, thus being able to employ hundreds or even thousands of cores in the computation of the minimum distance. Our experimental results show that our implementations are much faster, even up to several orders of magnitude, than current implementations widely used nowadays.

线性码的最小距离是信息论中的一个重要概念。因此，其计算所需的时间对于该领域的许多问题都是非常重要的。在本文中，我们将介绍用于计算𝔽2上随机线性代码的最小距离的分布式内存架构的browser - zimmermann算法的一系列实现。当前的商业和公共领域软件都只能在单核架构或共享内存架构上工作，这在计算中使用的核心/处理器数量上是有限的。我们的实现侧重于分布式内存架构，因此能够在最小距离的计算中使用数百甚至数千个内核。我们的实验结果表明，我们的实现比目前广泛使用的实现要快得多，甚至可以达到几个数量级。

引用次数: 0

Algorithm 1032: Bi-cubic Splines for Polyhedral Control Nets 算法1032:多面体控制网的双三次样条

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-03-21 DOI: https://dl.acm.org/doi/10.1145/3570158

Jörg Peters, Kyle Lo, Kȩstutis Karčiauskas

For control nets outlining a large class of topological polyhedra, not just tensor-product grids, bi-cubic polyhedral splines form a piecewise polynomial, first-order differentiable space that associates one function with each vertex. Akin to tensor-product splines, the resulting smooth surface approximates the polyhedron. Admissible polyhedral control nets consist of quadrilateral faces in a grid-like layout, star-configuration where n ≠ 4 quadrilateral faces join around an interior vertex, n-gon configurations, where 2n quadrilaterals surround an n-gon, polar configurations where a cone of n triangles meeting at a vertex is surrounded by a ribbon of n quadrilaterals, and three types of T-junctions where two quad-strips merge into one.

The bi-cubic pieces of a polyhedral spline have matching derivatives along their break lines, possibly after a known change of variables. The pieces are represented in Bernstein-Bézier form with coefficients depending linearly on the polyhedral control net, so that evaluation, differentiation, integration, moments, and so on, are no more costly than for standard tensor-product splines. Bi-cubic polyhedral splines can be used both to model geometry and for computing functions on the geometry. Although polyhedral splines do not offer nested refinement by refinement of the control net, polyhedral splines support engineering analysis of curved smooth objects. Coarse nets typically suffice since the splines efficiently model curved features. Algorithm 1032 is a C++ library with input-output example pairs and an IGES output choice.

对于控制网概述了一大类拓扑多面体，而不仅仅是张量积网格，双三次多面体样条形成了一个分段多项式，一阶可微空间，将一个函数与每个顶点相关联。类似于张量积样条，得到的光滑表面近似于多面体。可接受的多面体控制网由网格状布局的四边形面组成，星形结构(n≠4个四边形面围绕一个内部顶点连接)，n形结构(2n个四边形围绕一个n形)，极形结构(n个三角形的圆锥在一个顶点会合，被n个四边形的带包围)，以及三种类型的t形结(两个四边形合并为一个)。多面体样条的双立方块沿其断行具有匹配的导数，可能在已知变量变化之后。这些块以bernstein - bsamzier形式表示，其系数线性依赖于多面体控制网，因此评估、微分、积分、矩等并不比标准张量积样条花费更多。双三次多面体样条既可用于几何建模，也可用于几何上的函数计算。虽然多面体样条不能通过控制网的细化提供嵌套细化，但多面体样条支持曲面光滑对象的工程分析。粗网通常就足够了，因为样条可以有效地模拟曲线特征。算法1032是一个c++库，具有输入输出示例对和IGES输出选择。

{"title":"Algorithm 1032: Bi-cubic Splines for Polyhedral Control Nets","authors":"Jörg Peters, Kyle Lo, Kȩstutis Karčiauskas","doi":"https://dl.acm.org/doi/10.1145/3570158","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3570158","url":null,"abstract":"For control nets outlining a large class of topological polyhedra, not just tensor-product grids, bi-cubic polyhedral splines form a piecewise polynomial, first-order differentiable space that associates one function with each vertex. Akin to tensor-product splines, the resulting smooth surface approximates the polyhedron. Admissible polyhedral control nets consist of quadrilateral faces in a grid-like layout, star-configuration where n ≠ 4 quadrilateral faces join around an interior vertex, n-gon configurations, where 2n quadrilaterals surround an n-gon, polar configurations where a cone of n triangles meeting at a vertex is surrounded by a ribbon of n quadrilaterals, and three types of T-junctions where two quad-strips merge into one. The bi-cubic pieces of a polyhedral spline have matching derivatives along their break lines, possibly after a known change of variables. The pieces are represented in Bernstein-Bézier form with coefficients depending linearly on the polyhedral control net, so that evaluation, differentiation, integration, moments, and so on, are no more costly than for standard tensor-product splines. Bi-cubic polyhedral splines can be used both to model geometry and for computing functions on the geometry. Although polyhedral splines do not offer nested refinement by refinement of the control net, polyhedral splines support engineering analysis of curved smooth objects. Coarse nets typically suffice since the splines efficiently model curved features. Algorithm 1032 is a C++ library with input-output example pairs and an IGES output choice.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"65 ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Research through the SCIP Optimization Suite 8.0 通过SCIP优化套件8.0实现研究

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-03-10 DOI: 10.1145/3585516

Ksenia Bestuzheva, Mathieu Besançon, Weikun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, L. Eifler, Oliver Gaul, Gerald Gamrath, A. Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, A. Hoen, Christopher Hojny, R. V. D. Hulst, T. Koch, M. Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, M. Pfetsch, D. Rehfeldt, Steffan Schlein, Franziska SchlÃŰsser, Felipe Serrano, Y. Shinano, Boro Sofranac, Mark Turner, Stefan Vigerske, Fabian Wegscheider, Philip A. Wellner, Dieter Weninger, Jakob Witzig

The SCIP Optimization Suite provides a collection of software packages for mathematical optimization centered around the constraint integer programming framework SCIP. The focus of this article is on the role of the SCIP Optimization Suite in supporting research. SCIP’s main design principles are discussed, followed by a presentation of the latest performance improvements and developments in version 8.0, which serve both as examples of SCIP’s application as a research tool and as a platform for further developments. Furthermore, this article gives an overview of interfaces to other programming and modeling languages, new features that expand the possibilities for user interaction with the framework, and the latest developments in several extensions built upon SCIP.

SCIP优化套件提供了以约束整数规划框架SCIP为中心的数学优化软件包集合。本文的重点是SCIP优化套件在支持研究中的作用。讨论了SCIP的主要设计原则，然后介绍了最新的性能改进和8.0版本的发展，这些都是SCIP作为研究工具和进一步开发平台的应用示例。此外，本文还概述了与其他编程和建模语言的接口、扩展用户与框架交互可能性的新特性，以及基于SCIP构建的几个扩展的最新发展。

{"title":"Enabling Research through the SCIP Optimization Suite 8.0","authors":"Ksenia Bestuzheva, Mathieu Besançon, Weikun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, L. Eifler, Oliver Gaul, Gerald Gamrath, A. Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, A. Hoen, Christopher Hojny, R. V. D. Hulst, T. Koch, M. Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, M. Pfetsch, D. Rehfeldt, Steffan Schlein, Franziska SchlÃŰsser, Felipe Serrano, Y. Shinano, Boro Sofranac, Mark Turner, Stefan Vigerske, Fabian Wegscheider, Philip A. Wellner, Dieter Weninger, Jakob Witzig","doi":"10.1145/3585516","DOIUrl":"https://doi.org/10.1145/3585516","url":null,"abstract":"The SCIP Optimization Suite provides a collection of software packages for mathematical optimization centered around the constraint integer programming framework SCIP. The focus of this article is on the role of the SCIP Optimization Suite in supporting research. SCIP’s main design principles are discussed, followed by a presentation of the latest performance improvements and developments in version 8.0, which serve both as examples of SCIP’s application as a research tool and as a platform for further developments. Furthermore, this article gives an overview of interfaces to other programming and modeling languages, new features that expand the possibilities for user interaction with the framework, and the latest developments in several extensions built upon SCIP.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 21"},"PeriodicalIF":2.7,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42394524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Algorithm 1036: ATC, An Advanced Tucker Compression Library for Multidimensional Data 算法1036:ATC，一种多维数据的高级Tucker压缩库

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-03-01 DOI: 10.1145/3585514

Wouter Baert, N. Vannieuwenhoven

We present ATC, a C++ library for advanced Tucker-based lossy compression of dense multidimensional numerical data in a shared-memory parallel setting, based on the sequentially truncated higher-order singular value decomposition (ST-HOSVD) and bit plane truncation. Several techniques are proposed to improve speed, memory usage, error control and compression rate. First, a hybrid truncation scheme is described which combines Tucker rank truncation and TTHRESH quantization. We derive a novel expression to approximate the error of truncated Tucker decompositions in the case of core and factor perturbations. We parallelize the quantization and encoding scheme and adjust this phase to improve error control. Implementation aspects are described, such as an ST-HOSVD procedure using only a single transposition. We also discuss several usability features of ATC, including the presence of multiple interfaces, extensive data type support, and integrated downsampling of the decompressed data. Numerical results show that ATC maintains state-of-the-art Tucker compression rates while providing average speed-up factors of 2.2 to 3.5 and halving memory usage. Our compressor provides precise error control, deviating only 1.4% from the requested error on average. Finally, ATC often achieves higher compression than non-Tucker-based compressors in the high-error domain.

基于顺序截断的高阶奇异值分解(ST-HOSVD)和位平面截断，我们提出了一个c++库ATC，用于在共享内存并行设置中对密集多维数值数据进行基于tucker的高级有损压缩。提出了几种提高速度、内存使用、错误控制和压缩率的技术。首先，提出了一种结合Tucker秩截断和TTHRESH量化的混合截断方案。我们推导了一个新的表达式来近似在核心和因子扰动情况下截断Tucker分解的误差。我们将量化和编码方案并行化，并调整相位以改善误差控制。描述了实现方面，例如仅使用单个换位的ST-HOSVD过程。我们还讨论了ATC的几个可用性特性，包括多个接口的存在、广泛的数据类型支持以及对解压缩数据的集成下采样。数值结果表明，ATC在提供2.2到3.5的平均加速因子和减半内存使用的同时，保持了最先进的Tucker压缩率。我们的压缩机提供精确的误差控制，平均误差仅为要求误差的1.4%。最后，在高误差域，ATC通常比非基于塔克的压缩器实现更高的压缩。

{"title":"Algorithm 1036: ATC, An Advanced Tucker Compression Library for Multidimensional Data","authors":"Wouter Baert, N. Vannieuwenhoven","doi":"10.1145/3585514","DOIUrl":"https://doi.org/10.1145/3585514","url":null,"abstract":"We present ATC, a C++ library for advanced Tucker-based lossy compression of dense multidimensional numerical data in a shared-memory parallel setting, based on the sequentially truncated higher-order singular value decomposition (ST-HOSVD) and bit plane truncation. Several techniques are proposed to improve speed, memory usage, error control and compression rate. First, a hybrid truncation scheme is described which combines Tucker rank truncation and TTHRESH quantization. We derive a novel expression to approximate the error of truncated Tucker decompositions in the case of core and factor perturbations. We parallelize the quantization and encoding scheme and adjust this phase to improve error control. Implementation aspects are described, such as an ST-HOSVD procedure using only a single transposition. We also discuss several usability features of ATC, including the presence of multiple interfaces, extensive data type support, and integrated downsampling of the decompressed data. Numerical results show that ATC maintains state-of-the-art Tucker compression rates while providing average speed-up factors of 2.2 to 3.5 and halving memory usage. Our compressor provides precise error control, deviating only 1.4% from the requested error on average. Finally, ATC often achieves higher compression than non-Tucker-based compressors in the high-error domain.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 25"},"PeriodicalIF":2.7,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42418864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

CPFloat: A C Library for Simulating Low-precision Arithmetic CPFloat:一个模拟低精度算术的C库

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-02-25 DOI: 10.1145/3585515

M. Fasi, M. Mikaitis

One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic.

可以通过软件模拟低精度浮点运算，方法是在硬件中执行每个算术运算，然后将结果四舍五入到所需的有效位数。对于符合ieee的格式，舍入只需要标准的数学库函数，但是处理次法线、下溢和溢出需要特别注意，并且数值错误可能导致数学上正确的公式在有限算术中表现不正确。此外，随后的实现不一定是高效的，因为构建这些技术的库函数通常是为处理广泛的情况而设计的，可能没有针对舍入算法的特定需求进行优化。CPFloat是一个用于模拟低精度算术的C库。它为舍入、执行数学计算和查询模拟低精度格式的属性提供了有效的例程。该软件利用存储数字的格式的位级浮点表示，并用低级位操作和整数运算取代昂贵的库调用。在数值实验中，与现有的C、c++和MATLAB替代方案相比，新技术带来了相当大的加速(通常是一个数量级或更多)。据我们所知，CPFloat是目前用于实验自定义低精度浮点算法的最有效和最完整的库。

{"title":"CPFloat: A C Library for Simulating Low-precision Arithmetic","authors":"M. Fasi, M. Mikaitis","doi":"10.1145/3585515","DOIUrl":"https://doi.org/10.1145/3585515","url":null,"abstract":"One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 32"},"PeriodicalIF":2.7,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46068558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Task-based Parallel Programming for Scalable Matrix Product Algorithms 基于任务的可扩展矩阵积算法并行编程

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-02-24 DOI: 10.1145/3583560

E. Agullo, A. Buttari, A. Guermouche, J. Herrmann, Antoine Jego

Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write compact yet efficient and scalable routines for linear algebra computations. Although, this work focuses on dense General Matrix Multiplication, the proposed features enable the implementation of more complex algorithms. We describe the implementation of these features and of the resulting GEMM operation. Finally, we present an experimental analysis on two homogeneous supercomputers showing that our approach is competitive up to 32,768 CPU cores with state-of-the-art libraries and may outperform them for some problem dimensions. Although our code can use GPUs straightforwardly, we do not deal with this case because it implies other issues which are out of the scope of this work.

基于任务的编程模型已经成功地引起了高性能数学软件社区的兴趣，因为它们以一种高效和可移植的方式减轻了开发和实现分布式内存并行算法的部分负担。在越来越大、越来越异构的计算机集群中，这些模型似乎是维护和增强更复杂算法的一种方式。然而，基于任务的编程模型缺乏灵活性和必要的功能，无法以优雅紧凑的方式表达依赖于高级通信模式的可扩展算法。我们证明了序列任务流范式可以扩展到为线性代数计算编写紧凑、高效和可扩展的例程。尽管这项工作的重点是密集的通用矩阵乘法，但所提出的特征能够实现更复杂的算法。我们描述了这些特性的实现以及由此产生的GEMM操作。最后，我们对两台同类超级计算机进行了实验分析，表明我们的方法具有最先进库的32768个CPU核心的竞争力，并且在某些问题维度上可能优于它们。尽管我们的代码可以直接使用GPU，但我们不处理这种情况，因为它暗示了超出本工作范围的其他问题。

{"title":"Task-based Parallel Programming for Scalable Matrix Product Algorithms","authors":"E. Agullo, A. Buttari, A. Guermouche, J. Herrmann, Antoine Jego","doi":"10.1145/3583560","DOIUrl":"https://doi.org/10.1145/3583560","url":null,"abstract":"Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write compact yet efficient and scalable routines for linear algebra computations. Although, this work focuses on dense General Matrix Multiplication, the proposed features enable the implementation of more complex algorithms. We describe the implementation of these features and of the resulting GEMM operation. Finally, we present an experimental analysis on two homogeneous supercomputers showing that our approach is competitive up to 32,768 CPU cores with state-of-the-art libraries and may outperform them for some problem dimensions. Although our code can use GPUs straightforwardly, we do not deal with this case because it implies other issues which are out of the scope of this work.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 23"},"PeriodicalIF":2.7,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48323698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Algorithm xxx: Encapsulated error, a direct approach to evaluate floating-point accuracy 算法xxx:封装错误，一种直接计算浮点精度的方法

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-02-17 DOI: 10.1145/3549205

Nestor Demeure, C. Chevalier, C. Denis, P. Dossantos-Uzarralde

Floating-point numbers represent only a subset of real numbers. As such, floating-point arithmetic introduces approximations that can compound and have a significant impact on numerical simulations. We introduce Encapsulated error, a new way to estimate the numerical error of an application and provide a reference implementation, the Shaman library. Our method uses dedicated arithmetic over a type that encapsulates both the result the user would have had with the original computation and an approximation of its numerical error. We thus can measure the number of significant digits of any result or intermediate result in a simulation. We show that this approach, while simple, gives results competitive with state of the art methods. It has a smaller overhead, and it is compatible with parallelism, making it suitable for the study of large scale applications.

浮点数只表示实数的一个子集。因此，浮点运算引入了可以复合的近似，并对数值模拟产生重大影响。我们介绍了封装误差，这是一种估计应用程序数值误差的新方法，并提供了一个参考实现，即萨满库。我们的方法在一个类型上使用专用算术，该类型封装了用户在原始计算中的结果及其数值误差的近似值。因此，我们可以测量模拟中任何结果或中间结果的有效位数。我们表明，这种方法虽然简单，但其结果与最先进的方法相比具有竞争力。它具有较小的开销，并且兼容并行性，适合于大规模应用的研究。

引用次数: 0

Algorithm xxx: Encapsulated error, a direct approach to evaluate floating-point accuracy 算法xxx:封装错误，一种直接计算浮点精度的方法

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-02-17 DOI: https://dl.acm.org/doi/10.1145/3549205

Nestor Demeure, Cédric Chevalier, Christophe Denis, Pierre Dossantos-Uzarralde

Floating-point numbers represent only a subset of real numbers. As such, floating-point arithmetic introduces approximations that can compound and have a significant impact on numerical simulations. We introduce Encapsulated error, a new way to estimate the numerical error of an application and provide a reference implementation, the Shaman library. Our method uses dedicated arithmetic over a type that encapsulates both the result the user would have had with the original computation and an approximation of its numerical error. We thus can measure the number of significant digits of any result or intermediate result in a simulation. We show that this approach, while simple, gives results competitive with state of the art methods. It has a smaller overhead, and it is compatible with parallelism, making it suitable for the study of large scale applications.

浮点数只是实数的一个子集。因此，浮点运算引入了可以复合并对数值模拟产生重大影响的近似值。我们介绍了封装误差，这是一种估计应用程序数值误差的新方法，并提供了一个参考实现——Shaman库。我们的方法在一个类型上使用专用算法，该类型封装了用户使用原始计算得到的结果及其数值误差的近似值。因此，我们可以在模拟中测量任何结果或中间结果的有效位数。我们表明，这种方法虽然简单，但结果与最先进的方法相竞争。它具有较小的开销，并且与并行性兼容，使其适合于大规模应用程序的研究。

引用次数: 0

Combining Sparse Approximate Factorizations with Mixed-precision Iterative Refinement 稀疏近似分解与混合精度迭代细化的结合

IF 2.7 1区数学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Mathematical Software

Pub Date : 2023-02-06 DOI: 10.1145/3582493

P. Amestoy, A. Buttari, N. Higham, J. L’Excellent, Théo Mary, Bastien Vieublé

The standard LU factorization-based solution process for linear systems can be enhanced in speed or accuracy by employing mixed-precision iterative refinement. Most recent work has focused on dense systems. We investigate the potential of mixed-precision iterative refinement to enhance methods for sparse systems based on approximate sparse factorizations. In doing so, we first develop a new error analysis for LU- and GMRES-based iterative refinement under a general model of LU factorization that accounts for the approximation methods typically used by modern sparse solvers, such as low-rank approximations or relaxed pivoting strategies. We then provide a detailed performance analysis of both the execution time and memory consumption of different algorithms, based on a selected set of iterative refinement variants and approximate sparse factorizations. Our performance study uses the multifrontal solver MUMPS, which can exploit block low-rank factorization and static pivoting. We evaluate the performance of the algorithms on large, sparse problems coming from a variety of real-life and industrial applications showing that mixed-precision iterative refinement combined with approximate sparse factorization can lead to considerable reductions of both the time and memory consumption.

线性系统的标准基于LU因子分解的求解过程可以通过采用混合精度迭代精化来提高速度或精度。最近的工作集中在密集系统上。我们研究了混合精度迭代精化的潜力，以增强基于近似稀疏因子分解的稀疏系统方法。在这样做的过程中，我们首先在LU因子分解的一般模型下，为基于LU和GMRES的迭代精化开发了一种新的误差分析，该模型考虑了现代稀疏求解器通常使用的近似方法，如低秩近似或放松的枢轴策略。然后，我们基于一组选定的迭代细化变体和近似稀疏因子分解，对不同算法的执行时间和内存消耗进行了详细的性能分析。我们的性能研究使用了多前沿求解器MUMPS，它可以利用块低秩因子分解和静态枢轴。我们评估了算法在来自各种现实生活和工业应用的大型稀疏问题上的性能，表明混合精度迭代精化与近似稀疏因子分解相结合可以显著减少时间和内存消耗。

{"title":"Combining Sparse Approximate Factorizations with Mixed-precision Iterative Refinement","authors":"P. Amestoy, A. Buttari, N. Higham, J. L’Excellent, Théo Mary, Bastien Vieublé","doi":"10.1145/3582493","DOIUrl":"https://doi.org/10.1145/3582493","url":null,"abstract":"The standard LU factorization-based solution process for linear systems can be enhanced in speed or accuracy by employing mixed-precision iterative refinement. Most recent work has focused on dense systems. We investigate the potential of mixed-precision iterative refinement to enhance methods for sparse systems based on approximate sparse factorizations. In doing so, we first develop a new error analysis for LU- and GMRES-based iterative refinement under a general model of LU factorization that accounts for the approximation methods typically used by modern sparse solvers, such as low-rank approximations or relaxed pivoting strategies. We then provide a detailed performance analysis of both the execution time and memory consumption of different algorithms, based on a selected set of iterative refinement variants and approximate sparse factorizations. Our performance study uses the multifrontal solver MUMPS, which can exploit block low-rank factorization and static pivoting. We evaluate the performance of the algorithms on large, sparse problems coming from a variety of real-life and industrial applications showing that mixed-precision iterative refinement combined with approximate sparse factorization can lead to considerable reductions of both the time and memory consumption.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 29"},"PeriodicalIF":2.7,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43937560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7