ACM Transactions on Mathematical Software (TOMS)最新文献

英文中文

Tree Partitioning Reduction 树分区缩减

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-08-08 DOI: 10.1145/3328731

A. P. Diéguez, M. Amor, R. Doallo

Solving tridiagonal linear-equation systems is a fundamental computing kernel in a wide range of scientific and engineering applications, and its computation can be modeled with parallel algorithms. These parallel solvers are typically designed to compute problems whose data fit in a common shared-memory space where all the cores taking part in the computation have access. However, when the problem size is large, data cannot be entirely stored in the common shared-memory space, and a high number of high-latency communications are performed. One alternative is to partition the problem among different memory spaces. At this point, conventional parallel algorithms do not facilitate the partition of computation in independent tiles, since each reduction depends on equations that may be in different tiles. This article proposes an algorithm based on a tree reduction, called the Tree Partitioning Reduction (TPR) method, which partitions the problem into independent slices that can be partially computed in parallel within different common shared-memory spaces. The TPR method can be implemented for any parallel and distributed programming paradigm. Furthermore, in this work, TPR is efficiently implemented for CUDA GPUs to solve large size problems, providing highly competitive performance results with respect to existing packages, being, on average, 22.03× faster than CUSPARSE.

求解三对角线性方程组是广泛的科学和工程应用的基本计算核心，其计算可以用并行算法建模。这些并行求解器通常设计用于计算数据适合公共共享内存空间的问题，所有参与计算的核心都可以访问该空间。但是，当问题规模很大时，数据不能完全存储在公共共享内存空间中，并且需要执行大量的高延迟通信。一种替代方法是在不同的内存空间中对问题进行分区。在这一点上，传统的并行算法不便于在独立的块中划分计算，因为每次约简都依赖于可能在不同块中的方程。本文提出了一种基于树约简的算法，称为树分区约简(TPR)方法，该方法将问题划分为独立的片，可以在不同的公共共享内存空间中并行计算部分问题。TPR方法可以在任何并行和分布式编程范例中实现。此外，在这项工作中，TPR被有效地实现在CUDA gpu上，以解决大尺寸问题，相对于现有的软件包，提供了极具竞争力的性能结果，平均比CUSPARSE快22.03倍。

{"title":"Tree Partitioning Reduction","authors":"A. P. Diéguez, M. Amor, R. Doallo","doi":"10.1145/3328731","DOIUrl":"https://doi.org/10.1145/3328731","url":null,"abstract":"Solving tridiagonal linear-equation systems is a fundamental computing kernel in a wide range of scientific and engineering applications, and its computation can be modeled with parallel algorithms. These parallel solvers are typically designed to compute problems whose data fit in a common shared-memory space where all the cores taking part in the computation have access. However, when the problem size is large, data cannot be entirely stored in the common shared-memory space, and a high number of high-latency communications are performed. One alternative is to partition the problem among different memory spaces. At this point, conventional parallel algorithms do not facilitate the partition of computation in independent tiles, since each reduction depends on equations that may be in different tiles. This article proposes an algorithm based on a tree reduction, called the Tree Partitioning Reduction (TPR) method, which partitions the problem into independent slices that can be partially computed in parallel within different common shared-memory spaces. The TPR method can be implemented for any parallel and distributed programming paradigm. Furthermore, in this work, TPR is efficiently implemented for CUDA GPUs to solve large size problems, providing highly competitive performance results with respect to existing packages, being, on average, 22.03× faster than CUSPARSE.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"408 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2019-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76467447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Algorithm 998 算法998

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-08-08 DOI: 10.1145/3323925

C. Agulhari, Alexandre Felipe, R. Oliveira, P. Peres

The ROLMIP (Robust LMI Parser) is a toolbox specialized in control theory for uncertain linear systems, built to work under MATLAB jointly with YALMIP, to ease the programming of sufficient Linear Matrix Inequality (LMI) conditions that, if feasible, assure the validity of parameter-dependent LMIs in the entire set of uncertainty considered. This article presents the new version of the ROLMIP toolbox, which was completely remodeled to provide a high-level user-friendly interface to cope with distinct uncertain domains (hypercube and multi-simplex) and to treat time-varying parameters in discrete- and continuous-time. By means of simple commands, the user is able to define polynomial matrices as well as to describe the desired parameter-dependent LMIs in an easy way, considerably reducing the programming time to end up with implementable LMI conditions. Therefore, ROLMIP helps the popularization of the state-of-the-art robust control methods for uncertain systems based on LMIs among graduate students, researchers, and engineers in control systems.

ROLMIP(鲁棒LMI解析器)是一个专门研究不确定线性系统控制理论的工具箱，与YALMIP一起在MATLAB下工作，以简化充分线性矩阵不等式(LMI)条件的规划，如果可行，确保参数相关LMI在考虑的整个不确定性集合中的有效性。本文介绍了新版本的ROLMIP工具箱，该工具箱经过彻底改造，提供了一个高级用户友好界面，以处理不同的不确定域(超立方体和多单纯形)，并处理离散时间和连续时间中的时变参数。通过简单的命令，用户能够定义多项式矩阵，并以一种简单的方式描述所需的参数相关LMI，从而大大减少编程时间，最终获得可实现的LMI条件。因此，ROLMIP有助于在控制系统的研究生、研究人员和工程师中普及基于lmi的最先进的不确定系统鲁棒控制方法。

引用次数: 48

Algorithm 997 算法997

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-08-08 DOI: 10.1145/3310410

R. Speck

In this article, we present the Python framework pySDC for solving collocation problems with spectral deferred correction (SDC) methods and their time-parallel variant PFASST, the parallel full approximation scheme in space and time. pySDC features many implementations of SDC and PFASST, from simple implicit timestepping to high-order implicit-explicit or multi-implicit splitting and multilevel SDCs. The software package comes with many different, preimplemented examples and has seven tutorials to help new users with their first steps. Time parallelism is implemented either in an emulated way for debugging and prototyping or using MPI for benchmarking. The code is fully documented and tested using continuous integration, including most results of previous publications. Here, we describe the structure of the code by taking two different perspectives: those of the user and those of the developer. The first sheds light on the front-end, the examples, and the tutorials, and the second is used to describe the underlying implementation and the data structures. We show three different examples to highlight various aspects of the implementation, the capabilities, and the usage of pySDC. In addition, couplings to the FEniCS framework and PETSc, the latter including spatial parallelism with MPI, are described.

在本文中，我们提出了一个Python框架pySDC，用于解决光谱延迟校正(SDC)方法的搭配问题，以及它们的时间并行变体PFASST，即空间和时间上的并行全近似方案。pySDC具有许多SDC和PFASST的实现，从简单的隐式时间步进到高阶隐式显式或多隐式分裂和多级SDC。该软件包附带了许多不同的预实现示例，并有七个教程来帮助新用户迈出第一步。时间并行可以通过模拟的方式实现，用于调试和原型设计，也可以使用MPI进行基准测试。代码被完整地记录下来，并使用持续集成进行了测试，包括以前出版物的大多数结果。在这里，我们通过两个不同的视角来描述代码的结构:用户的视角和开发人员的视角。第一个部分介绍前端、示例和教程，第二个部分用于描述底层实现和数据结构。我们将展示三个不同的示例，以突出显示pySDC的实现、功能和使用的各个方面。此外，还描述了与FEniCS框架和PETSc的耦合，后者包括MPI的空间并行性。

{"title":"Algorithm 997","authors":"R. Speck","doi":"10.1145/3310410","DOIUrl":"https://doi.org/10.1145/3310410","url":null,"abstract":"In this article, we present the Python framework pySDC for solving collocation problems with spectral deferred correction (SDC) methods and their time-parallel variant PFASST, the parallel full approximation scheme in space and time. pySDC features many implementations of SDC and PFASST, from simple implicit timestepping to high-order implicit-explicit or multi-implicit splitting and multilevel SDCs. The software package comes with many different, preimplemented examples and has seven tutorials to help new users with their first steps. Time parallelism is implemented either in an emulated way for debugging and prototyping or using MPI for benchmarking. The code is fully documented and tested using continuous integration, including most results of previous publications. Here, we describe the structure of the code by taking two different perspectives: those of the user and those of the developer. The first sheds light on the front-end, the examples, and the tutorials, and the second is used to describe the underlying implementation and the data structures. We show three different examples to highlight various aspects of the implementation, the capabilities, and the usage of pySDC. In addition, couplings to the FEniCS framework and PETSc, the latter including spatial parallelism with MPI, are described.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"72 1","pages":"1 - 23"},"PeriodicalIF":0.0,"publicationDate":"2019-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88231176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU GraphBLAST:基于GPU的高性能线性代数图形框架

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-08-04 DOI: 10.1145/3466795

Carl Yang, A. Buluç, John Douglas Owens

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in “GraphBLAST”, the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.

图算法的高性能实现在新的并行硬件(如gpu)上实现是具有挑战性的，因为有三个挑战:(1)提出图构建块的困难，(2)并行硬件上的负载不平衡，以及(3)算法强度低的图问题。为了解决其中的一些挑战，GraphBLAS是图形分析社区提出的基于稀疏线性代数的构建块的创新的、持续的努力，它允许图形算法以一种高性能、简洁、可组合和可移植的方式表达。在本文中，我们研究了基于线性代数的方法构建图形框架的性能挑战，并描述了克服这些瓶颈的新设计原则。新的设计原则之一是利用输入稀疏性，它允许用户在不指定推拉方向的情况下编写图形算法。利用输出稀疏性，用户可以告诉后端他们不希望计算单个矢量化计算中输出的哪些值。负载平衡是平衡并行工作的一个重要特性。我们描述了处理具有不同特征的图的重要负载平衡特性。本文描述的设计原则已经在“GraphBLAST”中实现，GraphBLAST是第一个基于NVIDIA gpu的开源高性能线性代数图形框架。结果表明，在单个GPU上，GraphBLAST比以前的GraphBLAS实现(SuiteSparse和GBTL)平均至少有一个数量级的加速，性能可与最快的GPU硬连接原语和共享内存图形框架Ligra和Gunrock相媲美，性能优于任何其他GPU图形框架，同时提供更简单，更简洁的编程模型。

{"title":"GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU","authors":"Carl Yang, A. Buluç, John Douglas Owens","doi":"10.1145/3466795","DOIUrl":"https://doi.org/10.1145/3466795","url":null,"abstract":"High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in “GraphBLAST”, the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"12 1","pages":"1 - 51"},"PeriodicalIF":0.0,"publicationDate":"2019-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75148402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Adjoint Code Design Patterns 伴随代码设计模式

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-07-30 DOI: 10.1145/3326162

U. Naumann

Adjoint methods have become fundamental ingredients of the scientific computing toolbox over the past decades. Large-scale parameter sensitivity analysis, uncertainty quantification, and nonlinear optimization would otherwise turn out computationally infeasible. The symbolic derivation of adjoint mathematical models for relevant problems in science and engineering and their implementation in consistency with the implementation of the underlying primal model frequently proves highly challenging. Hence, an increased interest in algorithmic adjoints can be observed. The algorithmic derivation of adjoint numerical simulation programs shifts some of the problems faced from functional and numerical analysis to computer science. It becomes a highly complex software engineering task requiring expertise in software analysis, transformation, and optimization. Despite rather mature software tool support for algorithmic differentiation, substantial user intervention is typically required when targeting nontrivial numerical programs. A large number of patterns shared by numerous application codes results in repeated duplication of development effort. The adjoint code design patterns introduced in this article aim to reduce this problem through improved formalization from the software engineering perspective. Fully functional reference implementations are provided through github.

伴随方法在过去几十年中已经成为科学计算工具箱的基本组成部分。否则，大规模参数敏感性分析、不确定性量化和非线性优化在计算上是不可行的。科学和工程中相关问题的伴随数学模型的符号推导及其与底层原始模型的实现相一致的实现往往是极具挑战性的。因此，可以观察到对算法伴随的兴趣增加。伴随数值模拟程序的算法推导将一些问题从泛函和数值分析转移到计算机科学。它成为一项高度复杂的软件工程任务，需要软件分析、转换和优化方面的专业知识。尽管相当成熟的软件工具支持算法微分，但当针对非平凡的数值程序时，通常需要大量的用户干预。由众多应用程序代码共享的大量模式会导致重复的开发工作。本文介绍的伴随代码设计模式旨在通过从软件工程的角度改进形式化来减少这个问题。通过github提供了功能齐全的参考实现。

{"title":"Adjoint Code Design Patterns","authors":"U. Naumann","doi":"10.1145/3326162","DOIUrl":"https://doi.org/10.1145/3326162","url":null,"abstract":"Adjoint methods have become fundamental ingredients of the scientific computing toolbox over the past decades. Large-scale parameter sensitivity analysis, uncertainty quantification, and nonlinear optimization would otherwise turn out computationally infeasible. The symbolic derivation of adjoint mathematical models for relevant problems in science and engineering and their implementation in consistency with the implementation of the underlying primal model frequently proves highly challenging. Hence, an increased interest in algorithmic adjoints can be observed. The algorithmic derivation of adjoint numerical simulation programs shifts some of the problems faced from functional and numerical analysis to computer science. It becomes a highly complex software engineering task requiring expertise in software analysis, transformation, and optimization. Despite rather mature software tool support for algorithmic differentiation, substantial user intervention is typically required when targeting nontrivial numerical programs. A large number of patterns shared by numerous application codes results in repeated duplication of development effort. The adjoint code design patterns introduced in this article aim to reduce this problem through improved formalization from the software engineering perspective. Fully functional reference implementations are provided through github.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"122 1","pages":"1 - 32"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87665183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Enclosing Chebyshev Expansions in Linear Time 线性时间中的切比雪夫展开式

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-07-30 DOI: 10.1145/3319395

B. Hashemi

We consider the problem of computing rigorous enclosures for polynomials represented in the Chebyshev basis. Our aim is to compare and develop algorithms with a linear complexity in terms of the polynomial degree. A first category of methods relies on a direct interval evaluation of the given Chebyshev expansion in which Chebyshev polynomials are bounded, e.g., with a divide-and-conquer strategy. Our main category of methods that are based on the Clenshaw recurrence includes interval Clenshaw with defect correction (ICDC), and the spectral transformation of Clenshaw recurrence rewritten as a discrete dynamical system. An extension of the barycentric representation to interval arithmetic is also considered that has a log-linear complexity as it takes advantage of a verified discrete cosine transform. We compare different methods and provide illustrative numerical experiments. In particular, our eigenvalue-based methods are interesting for bounding the range of high-degree interval polynomials. Some of the methods rigorously compute narrow enclosures for high-degree Chebyshev expansions at thousands of points in a few seconds on an average computer. We also illustrate how to employ our methods as an automatic a posteriori forward error analysis tool to monitor the accuracy of the Chebfun feval command.

我们考虑了用切比雪夫基表示的多项式的严格围合的计算问题。我们的目标是比较和开发具有多项式度线性复杂度的算法。第一类方法依赖于给定Chebyshev展开的直接区间评估，其中Chebyshev多项式是有界的，例如，使用分治策略。基于克伦肖递归的方法主要包括区间克伦肖带缺陷校正(ICDC)和将克伦肖递归的谱变换改写为离散动力系统。还考虑了将重心表示扩展到区间算法的扩展，该扩展具有对数线性复杂性，因为它利用了经过验证的离散余弦变换。我们比较了不同的方法，并提供了说明性的数值实验。特别是，我们基于特征值的方法对于限定高次区间多项式的范围很有趣。有些方法在一台普通的计算机上，在几秒钟内严格计算数千个点的高度切比雪夫展开的狭窄外壳。我们还说明了如何使用我们的方法作为自动后验前向错误分析工具来监视Chebfun feval命令的准确性。

{"title":"Enclosing Chebyshev Expansions in Linear Time","authors":"B. Hashemi","doi":"10.1145/3319395","DOIUrl":"https://doi.org/10.1145/3319395","url":null,"abstract":"We consider the problem of computing rigorous enclosures for polynomials represented in the Chebyshev basis. Our aim is to compare and develop algorithms with a linear complexity in terms of the polynomial degree. A first category of methods relies on a direct interval evaluation of the given Chebyshev expansion in which Chebyshev polynomials are bounded, e.g., with a divide-and-conquer strategy. Our main category of methods that are based on the Clenshaw recurrence includes interval Clenshaw with defect correction (ICDC), and the spectral transformation of Clenshaw recurrence rewritten as a discrete dynamical system. An extension of the barycentric representation to interval arithmetic is also considered that has a log-linear complexity as it takes advantage of a verified discrete cosine transform. We compare different methods and provide illustrative numerical experiments. In particular, our eigenvalue-based methods are interesting for bounding the range of high-degree interval polynomials. Some of the methods rigorously compute narrow enclosures for high-degree Chebyshev expansions at thousands of points in a few seconds on an average computer. We also illustrate how to employ our methods as an automatic a posteriori forward error analysis tool to monitor the accuracy of the Chebfun feval command.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"160 1","pages":"1 - 33"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76973574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

High-performance Implementation of Elliptic Curve Cryptography Using Vector Instructions 使用矢量指令的椭圆曲线加密的高性能实现

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-07-30 DOI: 10.1145/3309759

Armando Faz-Hernández, Julio López, R. Dahab

Elliptic curve cryptosystems are considered an efficient alternative to conventional systems such as DSA and RSA. Recently, Montgomery and Edwards elliptic curves have been used to implement cryptosystems. In particular, the elliptic curves Curve25519 and Curve448 were used for instantiating Diffie-Hellman protocols named X25519 and X448. Mapping these curves to twisted Edwards curves allowed deriving two new signature instances, called Ed25519 and Ed448, of the Edwards Digital Signature Algorithm. In this work, we focus on the secure and efficient software implementation of these algorithms using SIMD parallel processing. We present software techniques that target the Intel AVX2 vector instruction set for accelerating prime field arithmetic and elliptic curve operations. Our contributions result in a high-performance software library for AVX2-ready processors. For example, our library computes digital signatures 19% (for Ed25519) and 29% (for Ed448) faster than previous optimized implementations. Also, our library improves by 10% and 20% the execution time of X25519 and X448, respectively.

椭圆曲线密码系统被认为是DSA和RSA等传统密码系统的有效替代方案。最近，Montgomery和Edwards椭圆曲线被用来实现密码系统。其中，椭圆曲线Curve25519和Curve448分别用于实例化名为X25519和X448的Diffie-Hellman协议。将这些曲线映射到扭曲的爱德华兹曲线上，可以得到爱德华兹数字签名算法的两个新的签名实例，称为Ed25519和Ed448。在这项工作中，我们重点关注使用SIMD并行处理的这些算法的安全有效的软件实现。我们提出了针对Intel AVX2矢量指令集的软件技术，用于加速素场运算和椭圆曲线运算。我们的贡献为avx2处理器提供了一个高性能软件库。例如，我们的库计算数字签名的速度比以前的优化实现快19%(用于Ed25519)和29%(用于Ed448)。此外，我们的库将X25519和X448的执行时间分别提高了10%和20%。

引用次数: 26

Algorithm 995 算法995

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-07-18 DOI: 10.1145/3301321

Juliette Pardue, Andrey N. Chernikov

A bottom-up approach to parallel anisotropic mesh generation is presented by building a mesh generator starting from the basic operations of vertex insertion and Delaunay triangles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases, still focus on two-dimensional domains. This automated parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions— one-dimensional spacing functions generated from an exponential distribution. This approach yields computational savings in mesh generation and flow solution through well-shaped anisotropic triangles instead of isotropic triangles. The validity of the meshes is shown through solution characteristic comparisons to verified reference solutions. A 79% parallel weak scaling efficiency on 1,024 distributed memory nodes, and a 72% parallel efficiency over the fastest sequential isotropic mesh generator on 512 distributed memory nodes, is shown through numerical experiments.

从顶点插入和Delaunay三角形的基本操作出发，构建了网格生成器，提出了一种自下而上的并行各向异性网格生成方法。专注于高升力设计或动态失速的应用，或数值方法和建模测试用例，仍然专注于二维领域。这种自动并行网格生成方法可以生成具有各向异性边界层的高保真非结构化网格，用于计算流体力学领域。各向异性要求增加了并行网格算法的复杂性，因为计算依赖于元素的局部对齐，而这又由几何边界和密度函数(由指数分布生成的一维间隔函数)决定。这种方法在网格生成和流动求解中节省了计算量，通过形状良好的各向异性三角形代替各向同性三角形。通过与已验证的参考解的解特征比较，证明了网格的有效性。数值实验表明，该算法在1024个分布式存储节点上的并行弱缩放效率为79%，在512个分布式存储节点上的并行效率为72%。

{"title":"Algorithm 995","authors":"Juliette Pardue, Andrey N. Chernikov","doi":"10.1145/3301321","DOIUrl":"https://doi.org/10.1145/3301321","url":null,"abstract":"A bottom-up approach to parallel anisotropic mesh generation is presented by building a mesh generator starting from the basic operations of vertex insertion and Delaunay triangles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases, still focus on two-dimensional domains. This automated parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions— one-dimensional spacing functions generated from an exponential distribution. This approach yields computational savings in mesh generation and flow solution through well-shaped anisotropic triangles instead of isotropic triangles. The validity of the meshes is shown through solution characteristic comparisons to verified reference solutions. A 79% parallel weak scaling efficiency on 1,024 distributed memory nodes, and a 72% parallel efficiency over the fastest sequential isotropic mesh generator on 512 distributed memory nodes, is shown through numerical experiments.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"33 1","pages":"1 - 30"},"PeriodicalIF":0.0,"publicationDate":"2019-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84554039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Algorithm 994 算法994

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-06-05 DOI: 10.1145/3302389

F. Hernando, Francisco D. Igual, G. Quintana-Ortí

The minimum distance of an error-correcting code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is crucial to many problems in this area. In this article, we present and assess a family of implementations of both the brute-force algorithm and the Brouwer-Zimmermann algorithm for computing the minimum distance of a random linear code over F2 that are faster than current implementations, both in the commercial and public domain. In addition to the basic sequential implementations, we present parallel and vectorized implementations that produce high performances on modern architectures. The attained performance results show the benefits of the developed optimized algorithms, which obtain remarkable improvements compared with state-of-the-art implementations widely used nowadays.

纠错码的最小距离是信息论中的一个重要概念。因此，以最小的计算成本计算代码的最小距离对于该领域的许多问题至关重要。在这篇文章中，我们提出并评估了一系列暴力算法和browser - zimmermann算法的实现，用于计算随机线性代码在F2上的最小距离，这些实现比目前的实现更快，无论是在商业领域还是在公共领域。除了基本的顺序实现之外，我们还介绍了在现代架构上产生高性能的并行和矢量化实现。所获得的性能结果表明所开发的优化算法的优点，与目前广泛使用的最先进的实现相比，得到了显着的改进。

引用次数: 4

CGPOPS CGPOPS

ACM Transactions on Mathematical Software (TOMS)

Pub Date : 2019-05-28 DOI: 10.1145/3390463

Yunus M. Agamawi, Anil V. Rao

A general-purpose C++ software program called CGPOPS is described for solving multiple-phase optimal control problems using adaptive direct orthogonal collocation methods. The software employs a Legendre-Gauss-Radau direct orthogonal collocation method to transcribe the continuous optimal control problem into a large sparse nonlinear programming problem (NLP). A class of hp mesh refinement methods are implemented that determine the number of mesh intervals and the degree of the approximating polynomial within each mesh interval to achieve a specified accuracy tolerance. The software is interfaced with the open source Newton NLP solver IPOPT. All derivatives required by the NLP solver are computed via central finite differencing, bicomplex-step derivative approximations, hyper-dual derivative approximations, or automatic differentiation. The key components of the software are described in detail, and the utility of the software is demonstrated on five optimal control problems of varying complexity. The software described in this article provides researchers a transitional platform to solve a wide variety of complex constrained optimal control problems.

本文描述了一个通用的c++软件程序CGPOPS，用于自适应直接正交配置法求解多相最优控制问题。该软件采用legende - gaas - radau直接正交配置法，将连续最优控制问题转化为大型稀疏非线性规划问题(NLP)。实现了一类hp网格细化方法，确定网格间隔的数量和每个网格间隔内逼近多项式的程度，以达到指定的精度公差。该软件与开源Newton NLP求解器IPOPT接口。NLP求解器所需的所有导数都是通过中心有限差分、双复阶导数近似、超对偶导数近似或自动微分来计算的。详细描述了该软件的关键组件，并在五个不同复杂程度的最优控制问题上展示了该软件的实用性。本文描述的软件为研究人员提供了一个过渡平台来解决各种复杂的约束最优控制问题。

引用次数: 18

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Transactions on Mathematical Software (TOMS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀