Jakub Klinkovský, T. Oberhuber, R. Fučík, Vítezslav Zabka
A general multi-purpose data structure for an efficient representation of conforming unstructured homogeneous meshes for scientific computations on CPU and GPU-based systems is presented. The data structure is provided as open-source software as part of the TNL library (https://tnl-project.org/). The abstract representation supports almost any cell shape and common 2D quadrilateral, 3D hexahedron and arbitrarily dimensional simplex shapes are currently built into the library. The implementation is highly configurable via templates of the C++ language, which allows avoiding the storage of unnecessary dynamic data. The internal memory layout is based on state-of-the-art sparse matrix storage formats, which are optimized for different hardware architectures in order to provide high-performance computations. The proposed data structure is also suitable for meshes decomposed into several subdomains and distributed computing using the Message Passing Interface (MPI). The efficiency of the implemented data structure on CPU and GPU hardware architectures is demonstrated on several benchmark problems and a comparison with another library. Its applicability to advanced numerical methods is demonstrated with an example problem of two-phase flow in porous media using a numerical scheme based on the mixed-hybrid finite element method (MHFEM). We show GPU speed-ups that rise above 20 in 2D and 50 in 3D when compared to sequential CPU computations, and above 2 in 2D and 9 in 3D when compared to 12-threaded CPU computations.
{"title":"Configurable Open-source Data Structure for Distributed Conforming Unstructured Homogeneous Meshes with GPU Support","authors":"Jakub Klinkovský, T. Oberhuber, R. Fučík, Vítezslav Zabka","doi":"10.1145/3536164","DOIUrl":"https://doi.org/10.1145/3536164","url":null,"abstract":"A general multi-purpose data structure for an efficient representation of conforming unstructured homogeneous meshes for scientific computations on CPU and GPU-based systems is presented. The data structure is provided as open-source software as part of the TNL library (https://tnl-project.org/). The abstract representation supports almost any cell shape and common 2D quadrilateral, 3D hexahedron and arbitrarily dimensional simplex shapes are currently built into the library. The implementation is highly configurable via templates of the C++ language, which allows avoiding the storage of unnecessary dynamic data. The internal memory layout is based on state-of-the-art sparse matrix storage formats, which are optimized for different hardware architectures in order to provide high-performance computations. The proposed data structure is also suitable for meshes decomposed into several subdomains and distributed computing using the Message Passing Interface (MPI). The efficiency of the implemented data structure on CPU and GPU hardware architectures is demonstrated on several benchmark problems and a comparison with another library. Its applicability to advanced numerical methods is demonstrated with an example problem of two-phase flow in porous media using a numerical scheme based on the mixed-hybrid finite element method (MHFEM). We show GPU speed-ups that rise above 20 in 2D and 50 in 3D when compared to sequential CPU computations, and above 2 in 2D and 9 in 3D when compared to 12-threaded CPU computations.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"10 1","pages":"1 - 30"},"PeriodicalIF":0.0,"publicationDate":"2022-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86142597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Audet, Sébastien Le Digabel, Viviane Rochon Montplaisir, C. Tribes
NOMADis a state-of-the-art software package for optimizing blackbox problems. In continuous development since 2001, it constantly evolved with the integration of new algorithmic features published in scientific publications. These features are motivated by real applications encountered by industrial partners. The latest major release of NOMAD, version 3, dates to 2008. Minor releases are produced as new features are incorporated. The present work describes NOMAD 4, a complete redesign of the previous version, with a new architecture providing more flexible code, added functionalities, and reusable code. We introduce algorithmic components, which are building blocks for more complex algorithms and can initiate other components, launch nested algorithms, or perform specialized tasks. They facilitate the implementation of new ideas, including the MegaSearchPoll component, warm and hot restarts, and a revised version of the PsdMads algorithm. Another main improvement of NOMAD 4 is the usage of parallelism, to simultaneously compute multiple blackbox evaluations and to maximize usage of available cores. Running different algorithms, tuning their parameters, and comparing their performance for optimization are simpler than before, while overall optimization performance is maintained between versions 3 and 4. NOMAD is freely available at www.gerad.ca/nomad and the whole project is visible at github.com/bbopt/nomad.
{"title":"Algorithm 1027: NOMAD Version 4: Nonlinear Optimization with the MADS Algorithm","authors":"C. Audet, Sébastien Le Digabel, Viviane Rochon Montplaisir, C. Tribes","doi":"10.1145/3544489","DOIUrl":"https://doi.org/10.1145/3544489","url":null,"abstract":"NOMADis a state-of-the-art software package for optimizing blackbox problems. In continuous development since 2001, it constantly evolved with the integration of new algorithmic features published in scientific publications. These features are motivated by real applications encountered by industrial partners. The latest major release of NOMAD, version 3, dates to 2008. Minor releases are produced as new features are incorporated. The present work describes NOMAD 4, a complete redesign of the previous version, with a new architecture providing more flexible code, added functionalities, and reusable code. We introduce algorithmic components, which are building blocks for more complex algorithms and can initiate other components, launch nested algorithms, or perform specialized tasks. They facilitate the implementation of new ideas, including the MegaSearchPoll component, warm and hot restarts, and a revised version of the PsdMads algorithm. Another main improvement of NOMAD 4 is the usage of parallelism, to simultaneously compute multiple blackbox evaluations and to maximize usage of available cores. Running different algorithms, tuning their parameters, and comparing their performance for optimization are simpler than before, while overall optimization performance is maintained between versions 3 and 4. NOMAD is freely available at www.gerad.ca/nomad and the whole project is visible at github.com/bbopt/nomad.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"11 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80405903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a new accurate summation algorithm based on the error-free summation into floating-point buckets. Our algorithm exploits ideas from Zhu and Hayes’ OnlineExactSum, but it uses a significantly smaller number of accumulators and has a better instruction-level parallelism. In the default setting, our implementation aaaSum returns a faithfully rounded floating-point approximation of the true sum. We also discuss possible modifications for the computation of reproducible, correctly rounded, and multiple precision floating-point approximations. The computational overhead for any of these modifications is kept comparably small. Numerical tests demonstrate that aaaSum performs well for very small to large problem sizes, independent of the condition number of the problem. We compare our algorithm with other accurate and high-precision summation approaches.
{"title":"Toward Accurate and Fast Summation","authors":"M. Lange","doi":"10.1145/3544488","DOIUrl":"https://doi.org/10.1145/3544488","url":null,"abstract":"We introduce a new accurate summation algorithm based on the error-free summation into floating-point buckets. Our algorithm exploits ideas from Zhu and Hayes’ OnlineExactSum, but it uses a significantly smaller number of accumulators and has a better instruction-level parallelism. In the default setting, our implementation aaaSum returns a faithfully rounded floating-point approximation of the true sum. We also discuss possible modifications for the computation of reproducible, correctly rounded, and multiple precision floating-point approximations. The computational overhead for any of these modifications is kept comparably small. Numerical tests demonstrate that aaaSum performs well for very small to large problem sizes, independent of the condition number of the problem. We compare our algorithm with other accurate and high-precision summation approaches.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"44 1","pages":"1 - 39"},"PeriodicalIF":0.0,"publicationDate":"2022-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85931115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tyler H. Chang, L.T. Watson, Jeffrey Larson, N. Neveu, W. Thacker, Shubhangi G. Deshpande, T. Lux
VTMOP is a Fortran 2008 software package containing two Fortran modules for solving computationally expensive bound-constrained blackbox multiobjective optimization problems. VTMOP implements the algorithm of [32], which handles two or more objectives, does not require any derivatives, and produces well-distributed points over the Pareto front. The first module contains a general framework for solving multiobjective optimization problems by combining response surface methodology, trust region methodology, and an adaptive weighting scheme. The second module features a driver subroutine that implements this framework when the objective functions can be wrapped as a Fortran subroutine. Support is provided for both serial and parallel execution paradigms, and VTMOP is demonstrated on several test problems as well as one real-world problem in the area of particle accelerator optimization.
{"title":"Algorithm 1028: VTMOP: Solver for Blackbox Multiobjective Optimization Problems","authors":"Tyler H. Chang, L.T. Watson, Jeffrey Larson, N. Neveu, W. Thacker, Shubhangi G. Deshpande, T. Lux","doi":"10.1145/3529258","DOIUrl":"https://doi.org/10.1145/3529258","url":null,"abstract":"VTMOP is a Fortran 2008 software package containing two Fortran modules for solving computationally expensive bound-constrained blackbox multiobjective optimization problems. VTMOP implements the algorithm of [32], which handles two or more objectives, does not require any derivatives, and produces well-distributed points over the Pareto front. The first module contains a general framework for solving multiobjective optimization problems by combining response surface methodology, trust region methodology, and an adaptive weighting scheme. The second module features a driver subroutine that implements this framework when the objective functions can be wrapped as a Fortran subroutine. Support is provided for both serial and parallel execution paradigms, and VTMOP is demonstrated on several test problems as well as one real-world problem in the area of particle accelerator optimization.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"27 1","pages":"1 - 34"},"PeriodicalIF":0.0,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82621587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of 𝒪(mn), while the tiled BLR-QR exhibits 𝒪(mn1.5 complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram–Schmidt (MGS) BLR-QR using fork-join parallelism and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k × 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.
{"title":"Parallel QR Factorization of Block Low-rank Matrices","authors":"M. R. Apriansyah, Rio Yokota","doi":"10.1145/3538647","DOIUrl":"https://doi.org/10.1145/3538647","url":null,"abstract":"We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of 𝒪(mn), while the tiled BLR-QR exhibits 𝒪(mn1.5 complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram–Schmidt (MGS) BLR-QR using fork-join parallelism and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k × 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"275 1","pages":"1 - 28"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85030943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.
{"title":"On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs","authors":"James D. Trotter, Xing Cai, S. Funke","doi":"10.1145/3503925","DOIUrl":"https://doi.org/10.1145/3503925","url":null,"abstract":"Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"89 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2022-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73638974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Lourenco, Jinhao Chen, Erick Moreno-Centeno, T. Davis
SPEX Left LU is a software package for exactly solving unsymmetric sparse linear systems. As a component of the sparse exact (SPEX) software package, SPEX Left LU can be applied to any input matrix, A, whose entries are integral, rational, or decimal, and provides a solution to the system ( Ax = b ) , which is either exact or accurate to user-specified precision. SPEX Left LU preorders the matrix A with a user-specified fill-reducing ordering and computes a left-looking LU factorization with the special property that each operation used to compute the L and U matrices is integral. Notable additional applications of this package include benchmarking the stability and accuracy of state-of-the-art linear solvers and determining whether singular-to-double-precision matrices are indeed singular. Computationally, this article evaluates the impact of several novel pivoting schemes in exact arithmetic, benchmarks the exact iterative solvers within Linbox, and benchmarks the accuracy of MATLAB sparse backslash. Most importantly, it is shown that SPEX Left LU outperforms the exact iterative solvers in run time on easy instances and in stability as the iterative solver fails on a sizeable subset of the tested (both easy and hard) instances. The SPEX Left LU package is written in ANSI C, comes with a MATLAB interface, and is distributed via GitHub, as a component of the SPEX software package, and as a component of SuiteSparse.
SPEX Left LU是一个精确求解非对称稀疏线性系统的软件包。作为稀疏精确(SPEX)软件包的一个组件,SPEX Left LU可以应用于任何输入矩阵a,其条目是整数、有理数或小数,并提供系统( Ax = b )的解决方案,该解决方案要么是精确的,要么是精确到用户指定的精度。SPEX左LU用用户指定的减少填充的排序对矩阵A进行预定,并使用用于计算L和U矩阵的每个操作都是积分的特殊性质来计算左LU分解。值得注意的是,该软件包的其他应用包括对最先进的线性求解器的稳定性和精度进行基准测试,并确定奇异到双精度矩阵是否确实是奇异的。在计算上,本文评估了几种新的旋转方案对精确算法的影响,对Linbox中的精确迭代求解器进行了基准测试,并对MATLAB稀疏反斜杠的准确性进行了基准测试。最重要的是,它显示了SPEX Left LU在简单实例上的运行时性能优于精确迭代求解器,并且在稳定性方面优于迭代求解器,因为迭代求解器在相当大的测试(简单和困难)实例子集上失败。SPEX Left LU包是用ANSI C编写的,附带一个MATLAB接口,并通过GitHub分发,作为SPEX软件包的一个组件,作为SuiteSparse的一个组件。
{"title":"Algorithm 1021: SPEX Left LU, Exactly Solving Sparse Linear Systems via a Sparse Left-looking Integer-preserving LU Factorization","authors":"Christopher Lourenco, Jinhao Chen, Erick Moreno-Centeno, T. Davis","doi":"10.1145/3519024","DOIUrl":"https://doi.org/10.1145/3519024","url":null,"abstract":"SPEX Left LU is a software package for exactly solving unsymmetric sparse linear systems. As a component of the sparse exact (SPEX) software package, SPEX Left LU can be applied to any input matrix, A, whose entries are integral, rational, or decimal, and provides a solution to the system ( Ax = b ) , which is either exact or accurate to user-specified precision. SPEX Left LU preorders the matrix A with a user-specified fill-reducing ordering and computes a left-looking LU factorization with the special property that each operation used to compute the L and U matrices is integral. Notable additional applications of this package include benchmarking the stability and accuracy of state-of-the-art linear solvers and determining whether singular-to-double-precision matrices are indeed singular. Computationally, this article evaluates the impact of several novel pivoting schemes in exact arithmetic, benchmarks the exact iterative solvers within Linbox, and benchmarks the accuracy of MATLAB sparse backslash. Most importantly, it is shown that SPEX Left LU outperforms the exact iterative solvers in run time on easy instances and in stability as the iterative solver fails on a sizeable subset of the tested (both easy and hard) instances. The SPEX Left LU package is written in ANSI C, comes with a MATLAB interface, and is distributed via GitHub, as a component of the SPEX software package, and as a component of SuiteSparse.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"14 1","pages":"1 - 23"},"PeriodicalIF":0.0,"publicationDate":"2022-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82138007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by the unexpected failure of the triangle intersection component of the Projection Algorithm for Nonmatching Grids (PANG), this article provides a robust version with proof of backward stability. The new triangle intersection algorithm ensures consistency and parsimony across three types of calculations. The set of intersections produced by the algorithm, called representations, is shown to match the set of geometric intersections, called models. The article concludes with a comparison between the old and new intersection algorithms for PANG using an example found to reliably generate failures in the former.
{"title":"A Provably Robust Algorithm for Triangle-triangle Intersections in Floating-point Arithmetic","authors":"Conor Mccoid, M. Gander","doi":"10.1145/3513264","DOIUrl":"https://doi.org/10.1145/3513264","url":null,"abstract":"Motivated by the unexpected failure of the triangle intersection component of the Projection Algorithm for Nonmatching Grids (PANG), this article provides a robust version with proof of backward stability. The new triangle intersection algorithm ensures consistency and parsimony across three types of calculations. The set of intersections produced by the algorithm, called representations, is shown to match the set of geometric intersections, called models. The article concludes with a comparison between the old and new intersection algorithms for PANG using an example found to reliably generate failures in the former.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"98 1","pages":"1 - 30"},"PeriodicalIF":0.0,"publicationDate":"2022-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80549268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A structured version of derivative-free random pattern search optimization algorithms is introduced, which is able to exploit coordinate partially separable structure (typically associated with sparsity) often present in unconstrained and bound-constrained optimization problems. This technique improves performance by orders of magnitude and makes it possible to solve large problems that otherwise are totally intractable by other derivative-free methods. A library of interpolation-based modelling tools is also described, which can be associated with the structured or unstructured versions of the initial pattern search algorithm. The use of the library further enhances performance, especially when associated with structure. The significant gains in performance associated with these two techniques are illustrated using a new freely-available release of the Brute Force Optimizer (BFO) package firstly introduced in [Porcelli and Toint 2017], which incorporates them. An interesting conclusion of the numerical results presented is that providing global structural information on a problem can result in significantly less evaluations of the objective function than attempting to building local Taylor-like models.
介绍了一种结构化的无导数随机模式搜索优化算法,该算法能够利用在无约束和有约束优化问题中经常出现的坐标部分可分结构(通常与稀疏性相关)。这种技术将性能提高了几个数量级,并使解决其他无导数方法完全无法解决的大问题成为可能。还描述了一个基于插值的建模工具库,它可以与初始模式搜索算法的结构化或非结构化版本相关联。库的使用进一步提高了性能,特别是在与结构相关联时。与这两种技术相关的性能显著提高,可以使用在[Porcelli and Toint 2017]中首次引入的新的免费版本的蛮力优化器(BFO)包来说明,该包包含了这两种技术。给出的数值结果的一个有趣的结论是,提供问题的全局结构信息可以导致比试图建立局部泰勒模型更少的目标函数评估。
{"title":"Exploiting Problem Structure in Derivative Free Optimization","authors":"M. Porcelli, P. Toint","doi":"10.1145/3474054","DOIUrl":"https://doi.org/10.1145/3474054","url":null,"abstract":"A structured version of derivative-free random pattern search optimization algorithms is introduced, which is able to exploit coordinate partially separable structure (typically associated with sparsity) often present in unconstrained and bound-constrained optimization problems. This technique improves performance by orders of magnitude and makes it possible to solve large problems that otherwise are totally intractable by other derivative-free methods. A library of interpolation-based modelling tools is also described, which can be associated with the structured or unstructured versions of the initial pattern search algorithm. The use of the library further enhances performance, especially when associated with structure. The significant gains in performance associated with these two techniques are illustrated using a new freely-available release of the Brute Force Optimizer (BFO) package firstly introduced in [Porcelli and Toint 2017], which incorporates them. An interesting conclusion of the numerical results presented is that providing global structural information on a problem can result in significantly less evaluations of the objective function than attempting to building local Taylor-like models.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"128 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2022-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77698048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article titled “Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing” by Anzt et al. presents a modern, linear operator centric, C++ library for sparse linear algebra. Experimental results in the article demonstrate that Ginkgo is a flexible and user-friendly framework capable of achieving high-performance on state-of-the-art GPU architectures. In this report, the Ginkgo library is installed and a subset of the experimental results are reproduced. Specifically, the experiment that shows the achieved memory bandwidth of the Ginkgo Krylov linear solvers on NVIDIA A100 and AMD MI100 GPUs is redone and the results are compared to what presented in the published article. Upon completion of the comparison, the published results are deemed reproducible.
{"title":"Reproduced Computational Results Report for “Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing”","authors":"C. Balos","doi":"10.1145/3480936","DOIUrl":"https://doi.org/10.1145/3480936","url":null,"abstract":"The article titled “Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing” by Anzt et al. presents a modern, linear operator centric, C++ library for sparse linear algebra. Experimental results in the article demonstrate that Ginkgo is a flexible and user-friendly framework capable of achieving high-performance on state-of-the-art GPU architectures. In this report, the Ginkgo library is installed and a subset of the experimental results are reproduced. Specifically, the experiment that shows the achieved memory bandwidth of the Ginkgo Krylov linear solvers on NVIDIA A100 and AMD MI100 GPUs is redone and the results are compared to what presented in the published article. Upon completion of the comparison, the published results are deemed reproducible.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"149 1","pages":"1 - 7"},"PeriodicalIF":0.0,"publicationDate":"2022-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85367126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}