MieSolver provides an efficient solver for the problem of wave propagation through a heterogeneous configuration of nonidentical circular cylinders. MieSolver allows great flexibility in the physical properties of each cylinder, and the cylinders may have opaque or penetrable cores, as well as an arbitrary number of penetrable layers. The wave propagation is governed by the two-dimensional Helmholtz equation and models electromagnetic, acoustic, and elastic waves. The solver is based on the Mie series solution for scattering by a single circular cylinder and hence is numerically stable and highly accurate. We demonstrate the accuracy of our software with extensive numerical experiments over a wide range of frequencies (about five orders of magnitude) and up to 60 cylinders.
{"title":"Algorithm 1009","authors":"S. Hawkins","doi":"10.1145/3381537","DOIUrl":"https://doi.org/10.1145/3381537","url":null,"abstract":"MieSolver provides an efficient solver for the problem of wave propagation through a heterogeneous configuration of nonidentical circular cylinders. MieSolver allows great flexibility in the physical properties of each cylinder, and the cylinders may have opaque or penetrable cores, as well as an arbitrary number of penetrable layers. The wave propagation is governed by the two-dimensional Helmholtz equation and models electromagnetic, acoustic, and elastic waves. The solver is based on the Mie series solution for scattering by a single circular cylinder and hence is numerically stable and highly accurate. We demonstrate the accuracy of our software with extensive numerical experiments over a wide range of frequencies (about five orders of magnitude) and up to 60 cylinders.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"9 1","pages":"1 - 28"},"PeriodicalIF":0.0,"publicationDate":"2020-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75165936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Abhyankar, G. Betrie, D. Maldonado, L. McInnes, Barry F. Smith, Hong Zhang
We present DMNetwork, a high-level package included in the PETSc library for the simulation of multiphysics phenomena over large-scale networked systems. The library aims at applications that have networked structures such as those in electrical, gas, and water distribution systems. DMNetwork provides data and topology management, parallelization for multiphysics systems over a network, and hierarchical and composable solvers to exploit the problem structure. DMNetwork eases the simulation development cycle by providing the necessary infrastructure through simple abstractions to define and query the network components. This article presents the design of DMNetwork, illustrates its user interface, and demonstrates its ability to solve multiphysics systems, such as an electric circuit, a network of power grid and water subnetworks, and transient hydraulic systems over large networks with more than 2 billion variables on extreme-scale computers using up to 30,000 processors.
{"title":"PETSc DMNetwork","authors":"S. Abhyankar, G. Betrie, D. Maldonado, L. McInnes, Barry F. Smith, Hong Zhang","doi":"10.1145/3344587","DOIUrl":"https://doi.org/10.1145/3344587","url":null,"abstract":"We present DMNetwork, a high-level package included in the PETSc library for the simulation of multiphysics phenomena over large-scale networked systems. The library aims at applications that have networked structures such as those in electrical, gas, and water distribution systems. DMNetwork provides data and topology management, parallelization for multiphysics systems over a network, and hierarchical and composable solvers to exploit the problem structure. DMNetwork eases the simulation development cycle by providing the necessary infrastructure through simple abstractions to define and query the network components. This article presents the design of DMNetwork, illustrates its user interface, and demonstrates its ability to solve multiphysics systems, such as an electric circuit, a network of power grid and water subnetworks, and transient hydraulic systems over large networks with more than 2 billion variables on extreme-scale computers using up to 30,000 processors.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"36 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81224257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present ShiftConvolvePoibin, a fast exact method to compute the tail of a Poisson-binomial distribution (PBD). Our method employs an exponential shift to retain its accuracy when computing a tail probability, and in practice we find that it is immune to the significant relative errors that other methods, exact or approximate, can suffer from when computing very small tail probabilities of the PBD. The accompanying R package is also competitive with the fastest implementations for computing the entire PBD.
{"title":"Exactly Computing the Tail of the Poisson-Binomial Distribution","authors":"N. Peres, Andrew Lee, U. Keich","doi":"10.1145/3460774","DOIUrl":"https://doi.org/10.1145/3460774","url":null,"abstract":"We present ShiftConvolvePoibin, a fast exact method to compute the tail of a Poisson-binomial distribution (PBD). Our method employs an exponential shift to retain its accuracy when computing a tail probability, and in practice we find that it is immune to the significant relative errors that other methods, exact or approximate, can suffer from when computing very small tail probabilities of the PBD. The accompanying R package is also competitive with the fastest implementations for computing the entire PBD.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"179 1","pages":"1 - 19"},"PeriodicalIF":0.0,"publicationDate":"2020-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80105643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Davis, W. Hager, Scott P. Kolodziej, S. Yeralan
Partitioning graphs is a common and useful operation in many areas, from parallel computing to VLSI design to sparse matrix algorithms. In this article, we introduce Mongoose, a multilevel hybrid graph partitioning algorithm and library. Building on previous work in multilevel partitioning frameworks and combinatoric approaches, we introduce novel stall-reducing and stall-free coarsening strategies, as well as an efficient hybrid algorithm leveraging (1) traditional combinatoric methods and (2) continuous quadratic programming formulations. We demonstrate how this new hybrid algorithm outperforms either strategy in isolation, and we also compare Mongoose to METIS and demonstrate its effectiveness on large and social networking (power law) graphs.
{"title":"Algorithm 1003","authors":"T. Davis, W. Hager, Scott P. Kolodziej, S. Yeralan","doi":"10.1145/3337792","DOIUrl":"https://doi.org/10.1145/3337792","url":null,"abstract":"Partitioning graphs is a common and useful operation in many areas, from parallel computing to VLSI design to sparse matrix algorithms. In this article, we introduce Mongoose, a multilevel hybrid graph partitioning algorithm and library. Building on previous work in multilevel partitioning frameworks and combinatoric approaches, we introduce novel stall-reducing and stall-free coarsening strategies, as well as an efficient hybrid algorithm leveraging (1) traditional combinatoric methods and (2) continuous quadratic programming formulations. We demonstrate how this new hybrid algorithm outperforms either strategy in isolation, and we also compare Mongoose to METIS and demonstrate its effectiveness on large and social networking (power law) graphs.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"45 1","pages":"1 - 18"},"PeriodicalIF":0.0,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77674132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Cui, W. Leng, Huaqing Liu, Linbo Zhang, Weiying Zheng
Given a shape regular tetrahedron and a curved surface that is defined implicitly by a nonlinear level set function and divides the tetrahedron into two sub-domains, a general-purpose, robust, and high-order numerical algorithm is proposed in this article for computing both volume integrals in the sub-domains and surface integrals on their common boundary. The algorithm uses a direct approach that decomposes 3D volume integrals or 2D surface integrals into multiple 1D integrals and computes the 1D integrals with Gaussian quadratures. It only requires finding roots of univariate nonlinear functions in given intervals and evaluating the integrand, the level set function, and the gradient of the level set function at given points. It can achieve arbitrarily high accuracy by increasing the orders of Gaussian quadratures, and it does not need extra a priori knowledge about the integrand and the level set function. The code for the algorithm is freely available in the open-source finite element toolbox Parallel Hierarchical Grid (PHG) and can serve as a basic building block for implementing 3D high-order numerical algorithms involving implicit interfaces or boundaries.
{"title":"High-order Numerical Quadratures in a Tetrahedron with an Implicitly Defined Curved Interface","authors":"Tao Cui, W. Leng, Huaqing Liu, Linbo Zhang, Weiying Zheng","doi":"10.1145/3372144","DOIUrl":"https://doi.org/10.1145/3372144","url":null,"abstract":"Given a shape regular tetrahedron and a curved surface that is defined implicitly by a nonlinear level set function and divides the tetrahedron into two sub-domains, a general-purpose, robust, and high-order numerical algorithm is proposed in this article for computing both volume integrals in the sub-domains and surface integrals on their common boundary. The algorithm uses a direct approach that decomposes 3D volume integrals or 2D surface integrals into multiple 1D integrals and computes the 1D integrals with Gaussian quadratures. It only requires finding roots of univariate nonlinear functions in given intervals and evaluating the integrand, the level set function, and the gradient of the level set function at given points. It can achieve arbitrarily high accuracy by increasing the orders of Gaussian quadratures, and it does not need extra a priori knowledge about the integrand and the level set function. The code for the algorithm is freely available in the open-source finite element toolbox Parallel Hierarchical Grid (PHG) and can serve as a basic building block for implementing 3D high-order numerical algorithms involving implicit interfaces or boundaries.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"32 1","pages":"1 - 18"},"PeriodicalIF":0.0,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86541787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Jónasson, S. Sigurdsson, H. F. Yngvason, Pétur Orri Ragnarsson, Páll Melsted
A set of Fortran subroutines for reverse mode algorithmic (or automatic) differentiation of the basic linear algebra subprograms (BLAS) is presented. This is preceded by a description of the mathematical tools used to obtain the formulae of these derivatives, with emphasis on special matrices supported by the BLAS: triangular, symmetric, and band. All single and double precision BLAS derivatives have been implemented, together with the Cholesky factorization from Linear Algebra Package (LAPACK). The subroutines are written in Fortran 2003 with a Fortran 77 interface to allow use from C and C++, as well as dynamic languages such as R, Python, Matlab, and Octave. The subroutines are all implemented by calling BLAS, thereby attaining fast runtime. Timing results show derivative runtimes that are about twice those of the corresponding BLAS, in line with theory. The emphasis is on reverse mode because it is more important for the main application that we have in mind, numerical optimization. Two examples are presented, one dealing with the least squares modeling of groundwater, and the other dealing with the maximum likelihood estimation of the parameters of a vector autoregressive time series. The article contains comprehensive tables of formulae for the BLAS derivatives as well as for several non-BLAS matrix operations commonly used in optimization.
{"title":"Algorithm 1005","authors":"K. Jónasson, S. Sigurdsson, H. F. Yngvason, Pétur Orri Ragnarsson, Páll Melsted","doi":"10.1145/3382191","DOIUrl":"https://doi.org/10.1145/3382191","url":null,"abstract":"A set of Fortran subroutines for reverse mode algorithmic (or automatic) differentiation of the basic linear algebra subprograms (BLAS) is presented. This is preceded by a description of the mathematical tools used to obtain the formulae of these derivatives, with emphasis on special matrices supported by the BLAS: triangular, symmetric, and band. All single and double precision BLAS derivatives have been implemented, together with the Cholesky factorization from Linear Algebra Package (LAPACK). The subroutines are written in Fortran 2003 with a Fortran 77 interface to allow use from C and C++, as well as dynamic languages such as R, Python, Matlab, and Octave. The subroutines are all implemented by calling BLAS, thereby attaining fast runtime. Timing results show derivative runtimes that are about twice those of the corresponding BLAS, in line with theory. The emphasis is on reverse mode because it is more important for the main application that we have in mind, numerical optimization. Two examples are presented, one dealing with the least squares modeling of groundwater, and the other dealing with the maximum likelihood estimation of the parameters of a vector autoregressive time series. The article contains comprehensive tables of formulae for the BLAS derivatives as well as for several non-BLAS matrix operations commonly used in optimization.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"33 1","pages":"1 - 20"},"PeriodicalIF":0.0,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78416019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carmen Arévalo, Erik Jonsson-Glans, Josefine Olander, M. S. Soto, Gustaf Söderlind
We present a software package, Modes, offering h-adaptive and p-adaptive linear multistep methods for first order initial value problems in ordinary differential equations. The implementation is based on a new parametric, grid-independent representation of multistep methods [Arévalo and Söderlind 2017]. Parameters are supplied for over 60 methods. For nonstiff problems, all maximal order methods (p=k for explicit and p=k+1 for implicit methods) are supported. For stiff computation, implicit methods of order p=k are included. A collection of step-size controllers based on digital filters is provided, generating smooth step-size sequences offering improved computational stability. Controllers may be selected to match method and problem classes. A new system for automatic order control is also provided for designated families of multistep methods, offering simultaneous h- and p-adaptivity. Implemented as a Matlab toolbox, the software covers high order computations with linear multistep methods within a unified, generic framework. Computational experiments show that the new software is competitive and offers qualitative improvements. Modes is available for downloading and is primarily intended as a platform for developing a new generation of state-of-the-art multistep solvers, as well as for true ceteris paribus evaluation of algorithmic components. This also enables method comparisons within a single implementation environment.
我们提出了一个软件包,Modes,提供了常微分方程一阶初值问题的h-自适应和p-自适应线性多步方法。该实现基于多步骤方法的一种新的参数化、网格无关的表示[arsamuvalo and Söderlind 2017]。为60多个方法提供了参数。对于非刚性问题,支持所有最大阶方法(显式方法为p=k,隐式方法为p=k+1)。对于刚性计算,包括p=k阶的隐式方法。提供了一组基于数字滤波器的步长控制器,生成平滑的步长序列,提供了更好的计算稳定性。可以选择控制器来匹配方法和问题类别。自动订单控制的新系统也提供了指定的多步骤方法家族,同时提供h-和p-适应性。作为Matlab工具箱实现,该软件在统一的通用框架内涵盖了线性多步骤方法的高阶计算。计算实验表明,新软件具有竞争力,并提供了定性改进。Modes可供下载,主要用作开发新一代最先进的多步骤求解器的平台,以及对算法组件进行真正的同等条件评估。这还支持在单个实现环境中进行方法比较。
{"title":"A Software Platform for Adaptive High Order Multistep Methods","authors":"Carmen Arévalo, Erik Jonsson-Glans, Josefine Olander, M. S. Soto, Gustaf Söderlind","doi":"10.1145/3372159","DOIUrl":"https://doi.org/10.1145/3372159","url":null,"abstract":"We present a software package, Modes, offering h-adaptive and p-adaptive linear multistep methods for first order initial value problems in ordinary differential equations. The implementation is based on a new parametric, grid-independent representation of multistep methods [Arévalo and Söderlind 2017]. Parameters are supplied for over 60 methods. For nonstiff problems, all maximal order methods (p=k for explicit and p=k+1 for implicit methods) are supported. For stiff computation, implicit methods of order p=k are included. A collection of step-size controllers based on digital filters is provided, generating smooth step-size sequences offering improved computational stability. Controllers may be selected to match method and problem classes. A new system for automatic order control is also provided for designated families of multistep methods, offering simultaneous h- and p-adaptivity. Implemented as a Matlab toolbox, the software covers high order computations with linear multistep methods within a unified, generic framework. Computational experiments show that the new software is competitive and offers qualitative improvements. Modes is available for downloading and is primarily intended as a platform for developing a new generation of state-of-the-art multistep solvers, as well as for true ceteris paribus evaluation of algorithmic components. This also enables method comparisons within a single implementation environment.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"81 1","pages":"1 - 17"},"PeriodicalIF":0.0,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76709535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iterated-integral signatures and log signatures are sequences calculated from a path that characterizes its shape. They originate from the work of K. T. Chen and have become important through Terry Lyons’s theory of differential equations driven by rough paths, which is an important developing area of stochastic analysis. They have applications in statistics and machine learning, where there can be a need to calculate finite parts of them quickly for many paths. We introduce the signature and the most basic information (displacement and signed areas) that it contains. We present algorithms for efficiently calculating these signatures. For log signatures this requires consideration of the structure of free Lie algebras. We benchmark the performance of the algorithms. The methods are implemented in C++ and released as a Python extension package, which also supports differentiation. In combination with a machine learning library (Tensorflow, PyTorch, or Theano), this allows end-to-end learning of neural networks involving signatures.
迭代积分签名和日志签名是从表征其形状的路径计算的序列。它们起源于K. T. Chen的工作,并通过Terry Lyons的粗糙路径驱动微分方程理论而变得重要,这是随机分析的一个重要发展领域。它们在统计学和机器学习中有应用,在这些领域,可能需要对许多路径快速计算它们的有限部分。我们介绍了签名及其包含的最基本信息(位移和签名区域)。我们提出了有效计算这些签名的算法。对于日志签名,这需要考虑自由李代数的结构。我们对算法的性能进行基准测试。这些方法在c++中实现,并作为Python扩展包发布,该扩展包也支持差异化。结合机器学习库(Tensorflow, PyTorch或Theano),这允许涉及签名的神经网络的端到端学习。
{"title":"Algorithm 1004","authors":"Jeremy Reizenstein, Benjamin Graham","doi":"10.1145/3371237","DOIUrl":"https://doi.org/10.1145/3371237","url":null,"abstract":"Iterated-integral signatures and log signatures are sequences calculated from a path that characterizes its shape. They originate from the work of K. T. Chen and have become important through Terry Lyons’s theory of differential equations driven by rough paths, which is an important developing area of stochastic analysis. They have applications in statistics and machine learning, where there can be a need to calculate finite parts of them quickly for many paths. We introduce the signature and the most basic information (displacement and signed areas) that it contains. We present algorithms for efficiently calculating these signatures. For log signatures this requires consideration of the structure of free Lie algebras. We benchmark the performance of the algorithms. The methods are implemented in C++ and released as a Python extension package, which also supports differentiation. In combination with a machine learning library (Tensorflow, PyTorch, or Theano), this allows end-to-end learning of neural networks involving signatures.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"26 1","pages":"1 - 21"},"PeriodicalIF":0.0,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74320556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from gemm, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for gemm and Strassen. Overall, our 1-level Strassen can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19× speedup with a crossover point at 7,680.
传统图形处理单元(GPU)实现的Strassen算法(Strassen)依赖于现有的高性能矩阵乘法(gem),以空间换取时间。因此,由于额外的内存开销,这种方法只能实现相对较大的“平方”矩阵的实际加速,并且由于相当大的工作空间,它们的使用受到限制。我们提出了新的Strassen原语的gpu,可以组成生成一个家族的Strassen算法。我们的算法利用gpu上的内存和线程层次结构,重用从gem继承的共享内存和注册文件,融合额外的操作,并避免额外的工作空间。我们通过批处理、流处理和原子操作进一步利用内核内部和内核间的并行性。我们建立了NVIDIA Volta gpu的性能模型,以选择合适的阻塞参数并预测gem和Strassen的性能。总的来说,与NVIDIA Tesla V100 GPU上的cublassgem相比,我们的1级Strassen可以实现高达1.11倍的加速,交叉点小至1536。有了额外的工作空间,我们的2级Strassen可以实现1.19倍的加速,交叉点为7680。
{"title":"Strassen’s Algorithm Reloaded on GPUs","authors":"Jianyu Huang, Chenhan D. Yu, R. Geijn","doi":"10.1145/3372419","DOIUrl":"https://doi.org/10.1145/3372419","url":null,"abstract":"Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from gemm, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for gemm and Strassen. Overall, our 1-level Strassen can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19× speedup with a crossover point at 7,680.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"54 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74143251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a computational procedure to evaluate the integral ∫yx sp-1 e-μs ds for 0 ≤ x < y ≤ +∞,μ = ±1, p> 0, which generalizes the lower (x=0) and upper (y=+∞) incomplete gamma functions. To allow for large values of x, y, and p while avoiding under/overflow issues in the standard double precision floating point arithmetic, we use an explicit normalization that is much more efficient than the classical ratio with the complete gamma function. The generalized incomplete gamma function is estimated with continued fractions, with integrations by parts, or, when x ≈ y, with the Romberg numerical integration algorithm. We show that the accuracy reached by our algorithm improves a recent state-of-the-art method by two orders of magnitude, and it is essentially optimal considering the limitations imposed by floating point arithmetic. Moreover, the admissible parameter range of our algorithm (0 ≤ p,x,y ≤ 1015) is much larger than competing algorithms, and its robustness is assessed through massive usage in an image processing application.
{"title":"Algorithm 1006","authors":"Rémy Abergel, L. Moisan","doi":"10.1145/3365983","DOIUrl":"https://doi.org/10.1145/3365983","url":null,"abstract":"We present a computational procedure to evaluate the integral ∫yx sp-1 e-μs ds for 0 ≤ x < y ≤ +∞,μ = ±1, p> 0, which generalizes the lower (x=0) and upper (y=+∞) incomplete gamma functions. To allow for large values of x, y, and p while avoiding under/overflow issues in the standard double precision floating point arithmetic, we use an explicit normalization that is much more efficient than the classical ratio with the complete gamma function. The generalized incomplete gamma function is estimated with continued fractions, with integrations by parts, or, when x ≈ y, with the Romberg numerical integration algorithm. We show that the accuracy reached by our algorithm improves a recent state-of-the-art method by two orders of magnitude, and it is essentially optimal considering the limitations imposed by floating point arithmetic. Moreover, the admissible parameter range of our algorithm (0 ≤ p,x,y ≤ 1015) is much larger than competing algorithms, and its robustness is assessed through massive usage in an image processing application.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"18 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90145877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}