首页 > 最新文献

2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)最新文献

英文 中文
A CRC-Based Concurrent Fault Detection Architecture for Galois/Counter Mode (GCM) 基于crc的GCM并发故障检测体系
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.19
Amir Ali Kouzeh Geran, A. Reyhani-Masoleh
The Galois/Counter Mode (GCM) is a recently adopted mode of operation for symmetric key cryptography to provide both data authenticity and confidentiality. To improve the reliability of hardware implementations of the GCM module, we propose a novel multiple-bit fault detection architecture for hardware implementation of the GCM module using cyclic redundancy check (CRC) codes. By changing the degree of the CRC generating polynomial, one can select the number of parity bits used in the fault detection scheme based on the available resources and required overheads. We derive new formulations for the corresponding fault-detection scheme for the entire GCM loop. Then, we provide FPGA implementation and fault coverage simulation results for different CRC generating polynomials. We show that using six parity bits, one can achieve high fault coverage of close to 100% with the critical path delay overhead of 23% and area overhead of 10.9% while the false alarm is 0.12%.
伽罗瓦/计数器模式(GCM)是最近采用的一种对称密钥加密操作模式,以提供数据真实性和保密性。为了提高GCM模块硬件实现的可靠性,我们提出了一种基于循环冗余校验(CRC)码的GCM模块硬件实现多比特故障检测体系结构。通过改变CRC生成多项式的程度,可以根据可用资源和所需开销选择故障检测方案中使用的奇偶校验位的数量。我们导出了整个GCM环的相应故障检测方案的新公式。然后给出了不同CRC生成多项式的FPGA实现和故障覆盖仿真结果。我们表明,使用6个奇偶校验位,可以实现接近100%的高故障覆盖率,关键路径延迟开销为23%,面积开销为10.9%,而误报警为0.12%。
{"title":"A CRC-Based Concurrent Fault Detection Architecture for Galois/Counter Mode (GCM)","authors":"Amir Ali Kouzeh Geran, A. Reyhani-Masoleh","doi":"10.1109/ARITH.2016.19","DOIUrl":"https://doi.org/10.1109/ARITH.2016.19","url":null,"abstract":"The Galois/Counter Mode (GCM) is a recently adopted mode of operation for symmetric key cryptography to provide both data authenticity and confidentiality. To improve the reliability of hardware implementations of the GCM module, we propose a novel multiple-bit fault detection architecture for hardware implementation of the GCM module using cyclic redundancy check (CRC) codes. By changing the degree of the CRC generating polynomial, one can select the number of parity bits used in the fault detection scheme based on the available resources and required overheads. We derive new formulations for the corresponding fault-detection scheme for the entire GCM loop. Then, we provide FPGA implementation and fault coverage simulation results for different CRC generating polynomials. We show that using six parity bits, one can achieve high fault coverage of close to 100% with the critical path delay overhead of 23% and area overhead of 10.9% while the false alarm is 0.12%.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116770723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Hybrid Position-Residues Number System 混合位置-残数系统
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.15
Karim Bigou, A. Tisserand
We propose an hybrid representation of large integers, or prime field elements, combining both positional and residue number systems (RNS). Our hybrid position-residues (HPR) number system mixes a high-radix positional representation and digits represented in RNS. RNS offers an important source of parallelism for addition, subtraction and multiplication operations. But, due to its non-positional property, it makes comparisons and modular reductions more costly than in a positional number system. HPR offers various trade-offs between internal parallelism and the efficiency of operations requiring position information. Our current application domain is asymmetric cryptography where HPR significantly reduces the cost of some modular operations compared to state-of-the-art RNS solutions.
我们提出了一种结合位置数系统和剩余数系统(RNS)的大整数或素数域元的混合表示。我们的混合位置-残数(HPR)数字系统混合了高基数位置表示和RNS表示的数字。RNS为加法、减法和乘法运算提供了重要的并行性来源。但是,由于它的非位置性质,它使得比较和模约化比在位置数系统中更昂贵。HPR在内部并行性和需要位置信息的操作效率之间提供了各种权衡。我们目前的应用领域是非对称加密,与最先进的RNS解决方案相比,HPR显著降低了一些模块化操作的成本。
{"title":"Hybrid Position-Residues Number System","authors":"Karim Bigou, A. Tisserand","doi":"10.1109/ARITH.2016.15","DOIUrl":"https://doi.org/10.1109/ARITH.2016.15","url":null,"abstract":"We propose an hybrid representation of large integers, or prime field elements, combining both positional and residue number systems (RNS). Our hybrid position-residues (HPR) number system mixes a high-radix positional representation and digits represented in RNS. RNS offers an important source of parallelism for addition, subtraction and multiplication operations. But, due to its non-positional property, it makes comparisons and modular reductions more costly than in a positional number system. HPR offers various trade-offs between internal parallelism and the efficiency of operations requiring position information. Our current application domain is asymmetric cryptography where HPR significantly reduces the cost of some modular operations compared to state-of-the-art RNS solutions.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129290858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A New Multiplication Algorithm for Extended Precision Using Floating-Point Expansions 一种利用浮点展开扩展精度的新乘法算法
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.18
J. Muller, Valentina Popescu, P. T. P. Tang
Some important computational problems must use a floating-point (FP) precision several times higher than the hardware-implemented available one. These computations critically rely on software libraries for high-precision FP arithmetic. The representation of a high-precision data type crucially influences the corresponding arithmetic algorithms. Recent work showed that algorithms for FP expansions, that is, a representation based on unevaluated sum of standard FP types, benefit from various high-performance support for native FP, such as low latency, high throughput, vectorization, threading, etc. Bailey's QD library and its corresponding Graphics Processing Unit (GPU) version, GQD, are such examples. Despite using native FP arithmetic as the key operations, QD and GQD algorithms are focused on double-double or quad-double representations and do not generalize efficiently or naturally to a flexible number of components in the FP expansion. In this paper, we introduce a new multiplication algorithm for FP expansion with flexible precision, up to the order of tens of FP elements in mind. The main feature consists in the partial products being accumulated in a special designed data structure that has the regularity of a fixed-point representation while allowing the computation to be naturally carried out using native FP types. This allows us to easily avoid unnecessary computation and to present rigorous accuracy analysis transparently. The algorithm, its correctness and accuracy proofs and some performance comparisons with existing libraries are all contributions of this paper.
一些重要的计算问题必须使用比硬件实现的精度高几倍的浮点(FP)精度。这些计算严重依赖于高精度FP算法的软件库。高精度数据类型的表示对相应的算法影响很大。最近的研究表明,FP展开算法(即基于标准FP类型的未求值和的表示)受益于对原生FP的各种高性能支持,如低延迟、高吞吐量、向量化、线程化等。Bailey的QD库及其相应的图形处理单元(GPU)版本GQD就是这样的例子。尽管使用原生FP算法作为关键操作,但QD和GQD算法侧重于双双或四双表示,并且不能有效或自然地推广到FP扩展中的灵活数量的组件。本文介绍了一种新的FP展开的乘法算法,其精度可达到几十个FP元素的数量级。其主要特性在于部分积在一个特殊设计的数据结构中,该数据结构具有定点表示的规律性,同时允许使用本机FP类型自然地执行计算。这使我们可以轻松地避免不必要的计算,并透明地呈现严格的精度分析。该算法的正确性和准确性证明以及与现有库的性能比较都是本文的贡献。
{"title":"A New Multiplication Algorithm for Extended Precision Using Floating-Point Expansions","authors":"J. Muller, Valentina Popescu, P. T. P. Tang","doi":"10.1109/ARITH.2016.18","DOIUrl":"https://doi.org/10.1109/ARITH.2016.18","url":null,"abstract":"Some important computational problems must use a floating-point (FP) precision several times higher than the hardware-implemented available one. These computations critically rely on software libraries for high-precision FP arithmetic. The representation of a high-precision data type crucially influences the corresponding arithmetic algorithms. Recent work showed that algorithms for FP expansions, that is, a representation based on unevaluated sum of standard FP types, benefit from various high-performance support for native FP, such as low latency, high throughput, vectorization, threading, etc. Bailey's QD library and its corresponding Graphics Processing Unit (GPU) version, GQD, are such examples. Despite using native FP arithmetic as the key operations, QD and GQD algorithms are focused on double-double or quad-double representations and do not generalize efficiently or naturally to a flexible number of components in the FP expansion. In this paper, we introduce a new multiplication algorithm for FP expansion with flexible precision, up to the order of tens of FP elements in mind. The main feature consists in the partial products being accumulated in a special designed data structure that has the regularity of a fixed-point representation while allowing the computation to be naturally carried out using native FP types. This allows us to easily avoid unnecessary computation and to present rigorous accuracy analysis transparently. The algorithm, its correctness and accuracy proofs and some performance comparisons with existing libraries are all contributions of this paper.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133919053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs 优化NVIDIA的Maxwell gpu的模块化乘法
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.21
Niall Emmart, J. Luitjens, C. Weems, Cliff Woolley
In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.
在本文中,我们展示了我们如何能够在NVIDIA最新一代图形处理单元(gpu) Maxwell上的新NVIDIA MP数学库(XMP)中实现创纪录的多精度(MP)模块化乘法(mulmod)运算率。Mulmod是一个在MP库中的多个地方使用的关键操作,并且有许多实际应用程序,特别是在密码学中,这使得实现高度优化的实现非常重要。在这里,我们揭示了多种技术是如何结合起来,以充分利用GPU的指令、寄存器、内存和线程的。一个特别有趣的算法方面,设计用于在Maxwell中找到的16位硬件乘法器,是使用两步过程首先计算未对齐的部分乘积,然后将结果向左移动16位,然后计算对齐的部分乘积。新算法比之前最先进的面向行乘法和约简方法要快得多,在256位时达到61%的速度,在512位时达到117%的速度,在GTX 980Ti上,256位和512位的峰值速率分别为4.027亿和10.81亿。
{"title":"Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs","authors":"Niall Emmart, J. Luitjens, C. Weems, Cliff Woolley","doi":"10.1109/ARITH.2016.21","DOIUrl":"https://doi.org/10.1109/ARITH.2016.21","url":null,"abstract":"In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129068939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Random Digit Representation of Integers 整数的随机数字表示
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.11
N. Méloni, M. A. Hasan
Modular exponentiation, or scalar multiplication, is core to today's main stream public key cryptographic systems. In this article we generalize the classical fractional wNAF method for modular exponentiation - the classical method uses a digit set of the form {1, 3, . . . , m} which is extended here to any set of odd integers of the form {1, d2, . . . , dn}. We propose a general modular exponentiation algorithm based on a generalization of the frac-wNAF recoding and a new precomputation scheme. We also give general formula for the average density of non-zero therms in these representations, prove that there are infinitely many optimal sets for a given number of digits and show that the asymptotic behavior, when those digits are randomly chosen, is very close to the optimal case.
模幂运算或标量乘法是当今主流公钥加密系统的核心。在本文中,我们推广了模幂的经典分数wNAF方法-经典方法使用形式为{1,3,…的数字集。, m},在这里扩展为形式为{1,d2,…的任意奇数集。, dn}。在对压裂- wnaf编码进行推广的基础上,提出了一种通用的模幂算法和一种新的预计算方案。我们还给出了这些表示中非零热的平均密度的一般公式,证明了对于给定数目的数字存在无穷多个最优集,并证明了当这些数字是随机选择时,其渐近行为非常接近于最优情况。
{"title":"Random Digit Representation of Integers","authors":"N. Méloni, M. A. Hasan","doi":"10.1109/ARITH.2016.11","DOIUrl":"https://doi.org/10.1109/ARITH.2016.11","url":null,"abstract":"Modular exponentiation, or scalar multiplication, is core to today's main stream public key cryptographic systems. In this article we generalize the classical fractional wNAF method for modular exponentiation - the classical method uses a digit set of the form {1, 3, . . . , m} which is extended here to any set of odd integers of the form {1, d2, . . . , dn}. We propose a general modular exponentiation algorithm based on a generalization of the frac-wNAF recoding and a new precomputation scheme. We also give general formula for the average density of non-zero therms in these representations, prove that there are infinitely many optimal sets for a given number of digits and show that the asymptotic behavior, when those digits are randomly chosen, is very close to the optimal case.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129022501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Hardware Implementation of AES Using Area-Optimal Polynomials for Composite-Field Representation GF(2^4)^2 of GF(2^8) 复合域表示GF(2^8)的GF(2^4)^2的面积最优多项式AES硬件实现
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.32
S. Gueron, S. Mathew
This paper discusses the question of optimizing AES hardware designs, by using the composite field representation GF(24)2 of the field GF(28), that underlies the definition of AES. Here, GF(24)2 is the field extension of the ground field GF(24) with an extension polynomial of the form x2 + αx + β, where a and β are elements of field GF(24). Previous designs with such representations used α = 1, which seemingly leads to some obvious savings. By contrast, we seek the optimal designs among all the possibilities. Our designs are based on mapping the input, output, round keys, and the AES operations to and from any one of the 2880 possible representations of GF(28) as (24)2. For each representation, we also explore three options for the affine/invaffine constants, resulting in a total of 8640 possible designs. We identify the smallest area representations for AES encryption-only, decryption-only, and for unified encryptiondecryption. Surprisingly, the optimal representations in each case are different from each other. In addition, we identify six distinct representations that are optimal, based on operating-mode and AES pipeline depth. Among other results, we show here a set of high-bandwidth 16-byte AES datapaths with the extension polynomials of the form x2 + αx + β where α ≠ 1, showing that the a-priori obvious choice of using α = 1, does not necessarily lead to the best result. We provide the full details of all the designs possibilities, together with their respective area, based on 22nm CMOS implementation.
本文通过使用作为AES定义基础的字段GF(28)的复合字段表示GF(24)2,讨论了优化AES硬件设计的问题。其中,GF(24)2是地面场GF(24)的场扩展,其扩展多项式形式为x2 + αx + β,其中a和β是场GF(24)的元素。以前使用这种表示的设计使用α = 1,这似乎会导致一些明显的节省。相比之下,我们在所有可能性中寻求最优设计。我们的设计基于将输入、输出、轮密钥和AES操作映射到GF(28)的2880种可能表示中的任何一种,并将其映射为(24)2。对于每种表示,我们还探索了仿射/非仿射常数的三种选择,总共产生了8640种可能的设计。我们确定了AES仅加密、仅解密和统一加密解密的最小区域表示。令人惊讶的是,每种情况下的最佳表示都是不同的。此外,我们根据操作模式和AES管道深度确定了六种不同的最佳表示。在其他结果中,我们在这里展示了一组高带宽16字节AES数据路径,其扩展多项式形式为x2 + αx + β,其中α≠1,表明使用α = 1的先验明显选择不一定会导致最佳结果。我们提供了基于22nm CMOS实现的所有设计可能性的完整细节,以及它们各自的面积。
{"title":"Hardware Implementation of AES Using Area-Optimal Polynomials for Composite-Field Representation GF(2^4)^2 of GF(2^8)","authors":"S. Gueron, S. Mathew","doi":"10.1109/ARITH.2016.32","DOIUrl":"https://doi.org/10.1109/ARITH.2016.32","url":null,"abstract":"This paper discusses the question of optimizing AES hardware designs, by using the composite field representation GF(2<sup>4</sup>)<sup>2</sup> of the field GF(2<sup>8</sup>), that underlies the definition of AES. Here, GF(2<sup>4</sup>)<sup>2</sup> is the field extension of the ground field GF(2<sup>4</sup>) with an extension polynomial of the form x2 + αx + β, where a and β are elements of field GF(2<sup>4</sup>). Previous designs with such representations used α = 1, which seemingly leads to some obvious savings. By contrast, we seek the optimal designs among all the possibilities. Our designs are based on mapping the input, output, round keys, and the AES operations to and from any one of the 2880 possible representations of GF(2<sup>8</sup>) as (2<sup>4</sup>)<sup>2</sup>. For each representation, we also explore three options for the affine/invaffine constants, resulting in a total of 8640 possible designs. We identify the smallest area representations for AES encryption-only, decryption-only, and for unified encryptiondecryption. Surprisingly, the optimal representations in each case are different from each other. In addition, we identify six distinct representations that are optimal, based on operating-mode and AES pipeline depth. Among other results, we show here a set of high-bandwidth 16-byte AES datapaths with the extension polynomials of the form x<sup>2</sup> + αx + β where α ≠ 1, showing that the a-priori obvious choice of using α = 1, does not necessarily lead to the best result. We provide the full details of all the designs possibilities, together with their respective area, based on 22nm CMOS implementation.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133971509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Efficient Combinational Circuits for Division by Small Integer Constants 小整数常数除法的高效组合电路
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.23
H. F. Ugurdag, A. Bayram, Vecdi Emre Levent, Sezer Gören
Division of an integer by an integer constant is a widely used operation and hence justifies a customized efficient implementation. There are various versions of this operation. This paper attacks a particular version of this problem, where the divisor is small and the circuit outputs a quotient and remainder. We propose a fast (low-latency) yet area-efficient combinational circuit topology, which we call Binary Tree based Constant Division (BTCD). BTCD uses a collection of small LUTs wired to each other to form a binary tree. The circuit also has bunch of adders, whose latencies are almost hidden as they operate in parallel with the binary tree. We wrote RTL code generators for BTCD and two previous works in the literature, then generated circuits for dividends of up to 128 bits and divisors of 3, 5, 11, and 23. We synthesized the generated RTL designs using a commercial ASIC synthesis tool. BTCD strikes a good balance between timing (latency) and area. It is up to 3.3 times better in Area-Timing Product (ATP) compared to the best alternative. ATP has a good correlation with energy consumption.
整数除以整数常数是一种广泛使用的操作,因此需要定制一种高效的实现。此操作有不同的版本。本文研究了这个问题的一个特殊版本,其中除数很小,电路输出商和余数。我们提出了一种快速(低延迟)且面积高效的组合电路拓扑,我们称之为基于二叉树的常数分割(BTCD)。BTCD使用一组相互连接的小lut来形成二叉树。电路中还有很多加法器,它们的延迟几乎是隐藏的,因为它们与二叉树并行运行。我们为BTCD和之前的两个文献作品编写了RTL代码生成器,然后为高达128位的股息和3,5,11和23的除数生成电路。我们使用商用ASIC合成工具合成生成的RTL设计。BTCD在时间(延迟)和面积之间取得了很好的平衡。与最佳替代产品相比,它的区域计时产品(ATP)性能提高了3.3倍。ATP与能量消耗有很好的相关性。
{"title":"Efficient Combinational Circuits for Division by Small Integer Constants","authors":"H. F. Ugurdag, A. Bayram, Vecdi Emre Levent, Sezer Gören","doi":"10.1109/ARITH.2016.23","DOIUrl":"https://doi.org/10.1109/ARITH.2016.23","url":null,"abstract":"Division of an integer by an integer constant is a widely used operation and hence justifies a customized efficient implementation. There are various versions of this operation. This paper attacks a particular version of this problem, where the divisor is small and the circuit outputs a quotient and remainder. We propose a fast (low-latency) yet area-efficient combinational circuit topology, which we call Binary Tree based Constant Division (BTCD). BTCD uses a collection of small LUTs wired to each other to form a binary tree. The circuit also has bunch of adders, whose latencies are almost hidden as they operate in parallel with the binary tree. We wrote RTL code generators for BTCD and two previous works in the literature, then generated circuits for dividends of up to 128 bits and divisors of 3, 5, 11, and 23. We synthesized the generated RTL designs using a commercial ASIC synthesis tool. BTCD strikes a good balance between timing (latency) and area. It is up to 3.3 times better in Area-Timing Product (ATP) compared to the best alternative. ATP has a good correlation with energy consumption.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132365748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Recovering Numerical Reproducibility in Hydrodynamic Simulations 恢复水动力模拟的数值再现性
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.27
P. Langlois, R. Nheili, C. Denis
HPC simulations suffer from failures of numerical reproducibility because of floating-point arithmetic peculiarities. Different computing distributions of a parallel computation may yield different numerical results. We are interested in a finite element computation of hydrodynamic simulations within the openTelemac software where parallelism is provided by domain decomposition. One main task in a finite element simulation consists in building one large linear system and to solve it. Here the building step relies on element-by-element storage mode and the solving step applies the conjugated gradient algorithm. The subdomain parallelism is merged within these steps. We study why reproducibility fails in this process and which operations have to be corrected. We detail how to use compensation techniques to compute a numerically reproducible resolution. We illustrate this approach with the reproducible version of one test case provided by the openTelemac software suite.
由于浮点运算的特殊性,HPC仿真在数值再现性方面存在缺陷。一个并行计算的不同计算分布可能产生不同的数值结果。我们感兴趣的是在openTelemac软件中的流体动力学模拟的有限元计算,其中并行性是由域分解提供的。有限元仿真的一个主要任务是建立一个大型线性系统并求解它。其中构建步骤依赖于逐单元存储模式,求解步骤采用共轭梯度算法。子域并行性在这些步骤中合并。我们研究了为什么再现性在这个过程中失败,哪些操作必须纠正。我们详细介绍了如何使用补偿技术来计算数值可再现的分辨率。我们用openTelemac软件套件提供的一个测试用例的可复制版本来说明这种方法。
{"title":"Recovering Numerical Reproducibility in Hydrodynamic Simulations","authors":"P. Langlois, R. Nheili, C. Denis","doi":"10.1109/ARITH.2016.27","DOIUrl":"https://doi.org/10.1109/ARITH.2016.27","url":null,"abstract":"HPC simulations suffer from failures of numerical reproducibility because of floating-point arithmetic peculiarities. Different computing distributions of a parallel computation may yield different numerical results. We are interested in a finite element computation of hydrodynamic simulations within the openTelemac software where parallelism is provided by domain decomposition. One main task in a finite element simulation consists in building one large linear system and to solve it. Here the building step relies on element-by-element storage mode and the solving step applies the conjugated gradient algorithm. The subdomain parallelism is merged within these steps. We study why reproducibility fails in this process and which operations have to be corrected. We detail how to use compensation techniques to compute a numerically reproducible resolution. We illustrate this approach with the reproducible version of one test case provided by the openTelemac software suite.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114208571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Accelerating Big Integer Arithmetic Using Intel IFMA Extensions 使用英特尔IFMA扩展加速大整数运算
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.22
S. Gueron, V. Krasnov
Intel has recently announced a new set of processor instructions, dubbed AVX512IFMA, that carry out Integer Fused Multiply Accumulate operations. These instructions operate on 512-bit registers and compute eight independent 52-bit unsigned integer multiplications, to generate eight 104-bit products, and accumulate their low/high halves into 64-bit containers. Using these instructions requires that inputs are converted to (redundant form) radix 252, and outputs are converted to the desired representation. This paper demonstrates several techniques for leveraging the AVX512IFMA instructions in order to speed up big-integer multiplications. Although processors that support AVX512IFMA are not yet available at the time this paper is written, we show how currently available public tools can be used for estimating their potential performance benefits. For example, based on these tools, we expect a 2x speedup for 1024-bit integer multiplication, over the best currently available method.
英特尔最近宣布了一套新的处理器指令,称为AVX512IFMA,用于执行整数融合乘法累加操作。这些指令在512位寄存器上操作,并计算8个独立的52位无符号整数乘法,生成8个104位乘积,并将它们的低/高一半累积到64位容器中。使用这些指令需要将输入转换为(冗余形式)基数252,并将输出转换为所需的表示形式。本文演示了利用AVX512IFMA指令来加速大整数乘法的几种技术。虽然在撰写本文时支持AVX512IFMA的处理器尚未可用,但我们展示了如何使用当前可用的公共工具来评估其潜在的性能优势。例如,基于这些工具,我们期望1024位整数乘法的速度比当前可用的最佳方法提高2倍。
{"title":"Accelerating Big Integer Arithmetic Using Intel IFMA Extensions","authors":"S. Gueron, V. Krasnov","doi":"10.1109/ARITH.2016.22","DOIUrl":"https://doi.org/10.1109/ARITH.2016.22","url":null,"abstract":"Intel has recently announced a new set of processor instructions, dubbed AVX512IFMA, that carry out Integer Fused Multiply Accumulate operations. These instructions operate on 512-bit registers and compute eight independent 52-bit unsigned integer multiplications, to generate eight 104-bit products, and accumulate their low/high halves into 64-bit containers. Using these instructions requires that inputs are converted to (redundant form) radix 252, and outputs are converted to the desired representation. This paper demonstrates several techniques for leveraging the AVX512IFMA instructions in order to speed up big-integer multiplications. Although processors that support AVX512IFMA are not yet available at the time this paper is written, we show how currently available public tools can be used for estimating their potential performance benefits. For example, based on these tools, we expect a 2x speedup for 1024-bit integer multiplication, over the best currently available method.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116532956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
A Parallel Decimal Multiplier Using Hybrid Binary Coded Decimal (BCD) Codes 使用混合二进制编码十进制(BCD)码的并行十进制乘法器
Pub Date : 2016-07-10 DOI: 10.1109/ARITH.2016.8
Xiaoping Cui, Weiqiang Liu, Dong Wenwen, F. Lombardi
A parallel decimal multiplier is proposed in this paper to improve performance by mainly exploiting the properties of three different binary coded decimal (BCD) codes, namely the redundant BCD excess-3 code (XS-3), the overloaded decimal digit set (ODDS) code and BCD-4221/5211 code, hence this design is referred to as hybrid. The signed-digit radix-10 recoding with the digit set {-5, 5} and the redundant BCD excess-3 (XS-3) representations are used for partial product (PP) generation. In this paper, a new decimal partial product reduction (PPR) tree is proposed, it consists of a binary PPR tree block, a nonfixed size BCD-4221 counter correction block and a BCD-4221/5211 decimal PPR tree block. Analysis and comparison using the logical effort model and 45 nm technology show that the proposed decimal multiplier is faster compared with previous designs found in the technical literature.
本文提出了一种并行十进制乘法器,主要利用冗余BCD超3码(XS-3)、过载十进制数集码(ODDS)和BCD-4221/5211码这三种不同的BCD码的特性来提高乘法器的性能,因此这种设计称为混合式乘法器。用数字集{- 5,5}的带符号数基数-10编码和冗余的BCD excess-3 (XS-3)表示法生成部分积(PP)。提出了一种新的十进制偏积约简(PPR)树,它由一个二进位PPR树块、一个不固定大小的BCD-4221计数器校正块和一个BCD-4221/5211十进制PPR树块组成。使用逻辑努力模型和45纳米技术进行分析和比较表明,与技术文献中先前的设计相比,所提出的十进制乘法器速度更快。
{"title":"A Parallel Decimal Multiplier Using Hybrid Binary Coded Decimal (BCD) Codes","authors":"Xiaoping Cui, Weiqiang Liu, Dong Wenwen, F. Lombardi","doi":"10.1109/ARITH.2016.8","DOIUrl":"https://doi.org/10.1109/ARITH.2016.8","url":null,"abstract":"A parallel decimal multiplier is proposed in this paper to improve performance by mainly exploiting the properties of three different binary coded decimal (BCD) codes, namely the redundant BCD excess-3 code (XS-3), the overloaded decimal digit set (ODDS) code and BCD-4221/5211 code, hence this design is referred to as hybrid. The signed-digit radix-10 recoding with the digit set {-5, 5} and the redundant BCD excess-3 (XS-3) representations are used for partial product (PP) generation. In this paper, a new decimal partial product reduction (PPR) tree is proposed, it consists of a binary PPR tree block, a nonfixed size BCD-4221 counter correction block and a BCD-4221/5211 decimal PPR tree block. Analysis and comparison using the logical effort model and 45 nm technology show that the proposed decimal multiplier is faster compared with previous designs found in the technical literature.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127974847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1