The Galois/Counter Mode (GCM) is a recently adopted mode of operation for symmetric key cryptography to provide both data authenticity and confidentiality. To improve the reliability of hardware implementations of the GCM module, we propose a novel multiple-bit fault detection architecture for hardware implementation of the GCM module using cyclic redundancy check (CRC) codes. By changing the degree of the CRC generating polynomial, one can select the number of parity bits used in the fault detection scheme based on the available resources and required overheads. We derive new formulations for the corresponding fault-detection scheme for the entire GCM loop. Then, we provide FPGA implementation and fault coverage simulation results for different CRC generating polynomials. We show that using six parity bits, one can achieve high fault coverage of close to 100% with the critical path delay overhead of 23% and area overhead of 10.9% while the false alarm is 0.12%.
{"title":"A CRC-Based Concurrent Fault Detection Architecture for Galois/Counter Mode (GCM)","authors":"Amir Ali Kouzeh Geran, A. Reyhani-Masoleh","doi":"10.1109/ARITH.2016.19","DOIUrl":"https://doi.org/10.1109/ARITH.2016.19","url":null,"abstract":"The Galois/Counter Mode (GCM) is a recently adopted mode of operation for symmetric key cryptography to provide both data authenticity and confidentiality. To improve the reliability of hardware implementations of the GCM module, we propose a novel multiple-bit fault detection architecture for hardware implementation of the GCM module using cyclic redundancy check (CRC) codes. By changing the degree of the CRC generating polynomial, one can select the number of parity bits used in the fault detection scheme based on the available resources and required overheads. We derive new formulations for the corresponding fault-detection scheme for the entire GCM loop. Then, we provide FPGA implementation and fault coverage simulation results for different CRC generating polynomials. We show that using six parity bits, one can achieve high fault coverage of close to 100% with the critical path delay overhead of 23% and area overhead of 10.9% while the false alarm is 0.12%.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116770723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose an hybrid representation of large integers, or prime field elements, combining both positional and residue number systems (RNS). Our hybrid position-residues (HPR) number system mixes a high-radix positional representation and digits represented in RNS. RNS offers an important source of parallelism for addition, subtraction and multiplication operations. But, due to its non-positional property, it makes comparisons and modular reductions more costly than in a positional number system. HPR offers various trade-offs between internal parallelism and the efficiency of operations requiring position information. Our current application domain is asymmetric cryptography where HPR significantly reduces the cost of some modular operations compared to state-of-the-art RNS solutions.
{"title":"Hybrid Position-Residues Number System","authors":"Karim Bigou, A. Tisserand","doi":"10.1109/ARITH.2016.15","DOIUrl":"https://doi.org/10.1109/ARITH.2016.15","url":null,"abstract":"We propose an hybrid representation of large integers, or prime field elements, combining both positional and residue number systems (RNS). Our hybrid position-residues (HPR) number system mixes a high-radix positional representation and digits represented in RNS. RNS offers an important source of parallelism for addition, subtraction and multiplication operations. But, due to its non-positional property, it makes comparisons and modular reductions more costly than in a positional number system. HPR offers various trade-offs between internal parallelism and the efficiency of operations requiring position information. Our current application domain is asymmetric cryptography where HPR significantly reduces the cost of some modular operations compared to state-of-the-art RNS solutions.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129290858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Some important computational problems must use a floating-point (FP) precision several times higher than the hardware-implemented available one. These computations critically rely on software libraries for high-precision FP arithmetic. The representation of a high-precision data type crucially influences the corresponding arithmetic algorithms. Recent work showed that algorithms for FP expansions, that is, a representation based on unevaluated sum of standard FP types, benefit from various high-performance support for native FP, such as low latency, high throughput, vectorization, threading, etc. Bailey's QD library and its corresponding Graphics Processing Unit (GPU) version, GQD, are such examples. Despite using native FP arithmetic as the key operations, QD and GQD algorithms are focused on double-double or quad-double representations and do not generalize efficiently or naturally to a flexible number of components in the FP expansion. In this paper, we introduce a new multiplication algorithm for FP expansion with flexible precision, up to the order of tens of FP elements in mind. The main feature consists in the partial products being accumulated in a special designed data structure that has the regularity of a fixed-point representation while allowing the computation to be naturally carried out using native FP types. This allows us to easily avoid unnecessary computation and to present rigorous accuracy analysis transparently. The algorithm, its correctness and accuracy proofs and some performance comparisons with existing libraries are all contributions of this paper.
{"title":"A New Multiplication Algorithm for Extended Precision Using Floating-Point Expansions","authors":"J. Muller, Valentina Popescu, P. T. P. Tang","doi":"10.1109/ARITH.2016.18","DOIUrl":"https://doi.org/10.1109/ARITH.2016.18","url":null,"abstract":"Some important computational problems must use a floating-point (FP) precision several times higher than the hardware-implemented available one. These computations critically rely on software libraries for high-precision FP arithmetic. The representation of a high-precision data type crucially influences the corresponding arithmetic algorithms. Recent work showed that algorithms for FP expansions, that is, a representation based on unevaluated sum of standard FP types, benefit from various high-performance support for native FP, such as low latency, high throughput, vectorization, threading, etc. Bailey's QD library and its corresponding Graphics Processing Unit (GPU) version, GQD, are such examples. Despite using native FP arithmetic as the key operations, QD and GQD algorithms are focused on double-double or quad-double representations and do not generalize efficiently or naturally to a flexible number of components in the FP expansion. In this paper, we introduce a new multiplication algorithm for FP expansion with flexible precision, up to the order of tens of FP elements in mind. The main feature consists in the partial products being accumulated in a special designed data structure that has the regularity of a fixed-point representation while allowing the computation to be naturally carried out using native FP types. This allows us to easily avoid unnecessary computation and to present rigorous accuracy analysis transparently. The algorithm, its correctness and accuracy proofs and some performance comparisons with existing libraries are all contributions of this paper.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133919053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niall Emmart, J. Luitjens, C. Weems, Cliff Woolley
In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.
{"title":"Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs","authors":"Niall Emmart, J. Luitjens, C. Weems, Cliff Woolley","doi":"10.1109/ARITH.2016.21","DOIUrl":"https://doi.org/10.1109/ARITH.2016.21","url":null,"abstract":"In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129068939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modular exponentiation, or scalar multiplication, is core to today's main stream public key cryptographic systems. In this article we generalize the classical fractional wNAF method for modular exponentiation - the classical method uses a digit set of the form {1, 3, . . . , m} which is extended here to any set of odd integers of the form {1, d2, . . . , dn}. We propose a general modular exponentiation algorithm based on a generalization of the frac-wNAF recoding and a new precomputation scheme. We also give general formula for the average density of non-zero therms in these representations, prove that there are infinitely many optimal sets for a given number of digits and show that the asymptotic behavior, when those digits are randomly chosen, is very close to the optimal case.
{"title":"Random Digit Representation of Integers","authors":"N. Méloni, M. A. Hasan","doi":"10.1109/ARITH.2016.11","DOIUrl":"https://doi.org/10.1109/ARITH.2016.11","url":null,"abstract":"Modular exponentiation, or scalar multiplication, is core to today's main stream public key cryptographic systems. In this article we generalize the classical fractional wNAF method for modular exponentiation - the classical method uses a digit set of the form {1, 3, . . . , m} which is extended here to any set of odd integers of the form {1, d2, . . . , dn}. We propose a general modular exponentiation algorithm based on a generalization of the frac-wNAF recoding and a new precomputation scheme. We also give general formula for the average density of non-zero therms in these representations, prove that there are infinitely many optimal sets for a given number of digits and show that the asymptotic behavior, when those digits are randomly chosen, is very close to the optimal case.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129022501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper discusses the question of optimizing AES hardware designs, by using the composite field representation GF(24)2 of the field GF(28), that underlies the definition of AES. Here, GF(24)2 is the field extension of the ground field GF(24) with an extension polynomial of the form x2 + αx + β, where a and β are elements of field GF(24). Previous designs with such representations used α = 1, which seemingly leads to some obvious savings. By contrast, we seek the optimal designs among all the possibilities. Our designs are based on mapping the input, output, round keys, and the AES operations to and from any one of the 2880 possible representations of GF(28) as (24)2. For each representation, we also explore three options for the affine/invaffine constants, resulting in a total of 8640 possible designs. We identify the smallest area representations for AES encryption-only, decryption-only, and for unified encryptiondecryption. Surprisingly, the optimal representations in each case are different from each other. In addition, we identify six distinct representations that are optimal, based on operating-mode and AES pipeline depth. Among other results, we show here a set of high-bandwidth 16-byte AES datapaths with the extension polynomials of the form x2 + αx + β where α ≠ 1, showing that the a-priori obvious choice of using α = 1, does not necessarily lead to the best result. We provide the full details of all the designs possibilities, together with their respective area, based on 22nm CMOS implementation.
{"title":"Hardware Implementation of AES Using Area-Optimal Polynomials for Composite-Field Representation GF(2^4)^2 of GF(2^8)","authors":"S. Gueron, S. Mathew","doi":"10.1109/ARITH.2016.32","DOIUrl":"https://doi.org/10.1109/ARITH.2016.32","url":null,"abstract":"This paper discusses the question of optimizing AES hardware designs, by using the composite field representation GF(2<sup>4</sup>)<sup>2</sup> of the field GF(2<sup>8</sup>), that underlies the definition of AES. Here, GF(2<sup>4</sup>)<sup>2</sup> is the field extension of the ground field GF(2<sup>4</sup>) with an extension polynomial of the form x2 + αx + β, where a and β are elements of field GF(2<sup>4</sup>). Previous designs with such representations used α = 1, which seemingly leads to some obvious savings. By contrast, we seek the optimal designs among all the possibilities. Our designs are based on mapping the input, output, round keys, and the AES operations to and from any one of the 2880 possible representations of GF(2<sup>8</sup>) as (2<sup>4</sup>)<sup>2</sup>. For each representation, we also explore three options for the affine/invaffine constants, resulting in a total of 8640 possible designs. We identify the smallest area representations for AES encryption-only, decryption-only, and for unified encryptiondecryption. Surprisingly, the optimal representations in each case are different from each other. In addition, we identify six distinct representations that are optimal, based on operating-mode and AES pipeline depth. Among other results, we show here a set of high-bandwidth 16-byte AES datapaths with the extension polynomials of the form x<sup>2</sup> + αx + β where α ≠ 1, showing that the a-priori obvious choice of using α = 1, does not necessarily lead to the best result. We provide the full details of all the designs possibilities, together with their respective area, based on 22nm CMOS implementation.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133971509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. F. Ugurdag, A. Bayram, Vecdi Emre Levent, Sezer Gören
Division of an integer by an integer constant is a widely used operation and hence justifies a customized efficient implementation. There are various versions of this operation. This paper attacks a particular version of this problem, where the divisor is small and the circuit outputs a quotient and remainder. We propose a fast (low-latency) yet area-efficient combinational circuit topology, which we call Binary Tree based Constant Division (BTCD). BTCD uses a collection of small LUTs wired to each other to form a binary tree. The circuit also has bunch of adders, whose latencies are almost hidden as they operate in parallel with the binary tree. We wrote RTL code generators for BTCD and two previous works in the literature, then generated circuits for dividends of up to 128 bits and divisors of 3, 5, 11, and 23. We synthesized the generated RTL designs using a commercial ASIC synthesis tool. BTCD strikes a good balance between timing (latency) and area. It is up to 3.3 times better in Area-Timing Product (ATP) compared to the best alternative. ATP has a good correlation with energy consumption.
{"title":"Efficient Combinational Circuits for Division by Small Integer Constants","authors":"H. F. Ugurdag, A. Bayram, Vecdi Emre Levent, Sezer Gören","doi":"10.1109/ARITH.2016.23","DOIUrl":"https://doi.org/10.1109/ARITH.2016.23","url":null,"abstract":"Division of an integer by an integer constant is a widely used operation and hence justifies a customized efficient implementation. There are various versions of this operation. This paper attacks a particular version of this problem, where the divisor is small and the circuit outputs a quotient and remainder. We propose a fast (low-latency) yet area-efficient combinational circuit topology, which we call Binary Tree based Constant Division (BTCD). BTCD uses a collection of small LUTs wired to each other to form a binary tree. The circuit also has bunch of adders, whose latencies are almost hidden as they operate in parallel with the binary tree. We wrote RTL code generators for BTCD and two previous works in the literature, then generated circuits for dividends of up to 128 bits and divisors of 3, 5, 11, and 23. We synthesized the generated RTL designs using a commercial ASIC synthesis tool. BTCD strikes a good balance between timing (latency) and area. It is up to 3.3 times better in Area-Timing Product (ATP) compared to the best alternative. ATP has a good correlation with energy consumption.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132365748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
HPC simulations suffer from failures of numerical reproducibility because of floating-point arithmetic peculiarities. Different computing distributions of a parallel computation may yield different numerical results. We are interested in a finite element computation of hydrodynamic simulations within the openTelemac software where parallelism is provided by domain decomposition. One main task in a finite element simulation consists in building one large linear system and to solve it. Here the building step relies on element-by-element storage mode and the solving step applies the conjugated gradient algorithm. The subdomain parallelism is merged within these steps. We study why reproducibility fails in this process and which operations have to be corrected. We detail how to use compensation techniques to compute a numerically reproducible resolution. We illustrate this approach with the reproducible version of one test case provided by the openTelemac software suite.
{"title":"Recovering Numerical Reproducibility in Hydrodynamic Simulations","authors":"P. Langlois, R. Nheili, C. Denis","doi":"10.1109/ARITH.2016.27","DOIUrl":"https://doi.org/10.1109/ARITH.2016.27","url":null,"abstract":"HPC simulations suffer from failures of numerical reproducibility because of floating-point arithmetic peculiarities. Different computing distributions of a parallel computation may yield different numerical results. We are interested in a finite element computation of hydrodynamic simulations within the openTelemac software where parallelism is provided by domain decomposition. One main task in a finite element simulation consists in building one large linear system and to solve it. Here the building step relies on element-by-element storage mode and the solving step applies the conjugated gradient algorithm. The subdomain parallelism is merged within these steps. We study why reproducibility fails in this process and which operations have to be corrected. We detail how to use compensation techniques to compute a numerically reproducible resolution. We illustrate this approach with the reproducible version of one test case provided by the openTelemac software suite.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114208571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intel has recently announced a new set of processor instructions, dubbed AVX512IFMA, that carry out Integer Fused Multiply Accumulate operations. These instructions operate on 512-bit registers and compute eight independent 52-bit unsigned integer multiplications, to generate eight 104-bit products, and accumulate their low/high halves into 64-bit containers. Using these instructions requires that inputs are converted to (redundant form) radix 252, and outputs are converted to the desired representation. This paper demonstrates several techniques for leveraging the AVX512IFMA instructions in order to speed up big-integer multiplications. Although processors that support AVX512IFMA are not yet available at the time this paper is written, we show how currently available public tools can be used for estimating their potential performance benefits. For example, based on these tools, we expect a 2x speedup for 1024-bit integer multiplication, over the best currently available method.
{"title":"Accelerating Big Integer Arithmetic Using Intel IFMA Extensions","authors":"S. Gueron, V. Krasnov","doi":"10.1109/ARITH.2016.22","DOIUrl":"https://doi.org/10.1109/ARITH.2016.22","url":null,"abstract":"Intel has recently announced a new set of processor instructions, dubbed AVX512IFMA, that carry out Integer Fused Multiply Accumulate operations. These instructions operate on 512-bit registers and compute eight independent 52-bit unsigned integer multiplications, to generate eight 104-bit products, and accumulate their low/high halves into 64-bit containers. Using these instructions requires that inputs are converted to (redundant form) radix 252, and outputs are converted to the desired representation. This paper demonstrates several techniques for leveraging the AVX512IFMA instructions in order to speed up big-integer multiplications. Although processors that support AVX512IFMA are not yet available at the time this paper is written, we show how currently available public tools can be used for estimating their potential performance benefits. For example, based on these tools, we expect a 2x speedup for 1024-bit integer multiplication, over the best currently available method.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116532956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoping Cui, Weiqiang Liu, Dong Wenwen, F. Lombardi
A parallel decimal multiplier is proposed in this paper to improve performance by mainly exploiting the properties of three different binary coded decimal (BCD) codes, namely the redundant BCD excess-3 code (XS-3), the overloaded decimal digit set (ODDS) code and BCD-4221/5211 code, hence this design is referred to as hybrid. The signed-digit radix-10 recoding with the digit set {-5, 5} and the redundant BCD excess-3 (XS-3) representations are used for partial product (PP) generation. In this paper, a new decimal partial product reduction (PPR) tree is proposed, it consists of a binary PPR tree block, a nonfixed size BCD-4221 counter correction block and a BCD-4221/5211 decimal PPR tree block. Analysis and comparison using the logical effort model and 45 nm technology show that the proposed decimal multiplier is faster compared with previous designs found in the technical literature.
{"title":"A Parallel Decimal Multiplier Using Hybrid Binary Coded Decimal (BCD) Codes","authors":"Xiaoping Cui, Weiqiang Liu, Dong Wenwen, F. Lombardi","doi":"10.1109/ARITH.2016.8","DOIUrl":"https://doi.org/10.1109/ARITH.2016.8","url":null,"abstract":"A parallel decimal multiplier is proposed in this paper to improve performance by mainly exploiting the properties of three different binary coded decimal (BCD) codes, namely the redundant BCD excess-3 code (XS-3), the overloaded decimal digit set (ODDS) code and BCD-4221/5211 code, hence this design is referred to as hybrid. The signed-digit radix-10 recoding with the digit set {-5, 5} and the redundant BCD excess-3 (XS-3) representations are used for partial product (PP) generation. In this paper, a new decimal partial product reduction (PPR) tree is proposed, it consists of a binary PPR tree block, a nonfixed size BCD-4221 counter correction block and a BCD-4221/5211 decimal PPR tree block. Analysis and comparison using the logical effort model and 45 nm technology show that the proposed decimal multiplier is faster compared with previous designs found in the technical literature.","PeriodicalId":145448,"journal":{"name":"2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127974847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}