The performance of many cryptographic primitives is reliant on efficient algorithms and implementation techniques for arithmetic in binary fields. While dedicated hardware support for said arithmetic is an emerging trend, the study of software-only implementation techniques remains important for legacy or non-equipped processors. One such technique is that of software-based bit-slicing. In the context of binary fields, this is an interesting option since there is extensive previous work on bit-oriented designs for arithmetic in hardware, such designs are intuitively well suited to bit-slicing in software. In this paper we harness previous work, using it to investigate bit-sliced, software-only implementation arithmetic for binary fields, over a range of practical field sizes and using a normal basis representation. We apply our results to demonstrate significant performance improvements for a stream cipher, and over the frequently employed Ning-Yin approach to normal basis implementation in software.
{"title":"Bit-Sliced Binary Normal Basis Multiplication","authors":"B. Brumley, D. Page","doi":"10.1109/ARITH.2011.36","DOIUrl":"https://doi.org/10.1109/ARITH.2011.36","url":null,"abstract":"The performance of many cryptographic primitives is reliant on efficient algorithms and implementation techniques for arithmetic in binary fields. While dedicated hardware support for said arithmetic is an emerging trend, the study of software-only implementation techniques remains important for legacy or non-equipped processors. One such technique is that of software-based bit-slicing. In the context of binary fields, this is an interesting option since there is extensive previous work on bit-oriented designs for arithmetic in hardware, such designs are intuitively well suited to bit-slicing in software. In this paper we harness previous work, using it to investigate bit-sliced, software-only implementation arithmetic for binary fields, over a range of practical field sizes and using a normal basis representation. We apply our results to demonstrate significant performance improvements for a stream cipher, and over the frequently employed Ning-Yin approach to normal basis implementation in software.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133670094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The hardware implementation of modular exponentiation for very large integers is a well-known topic in digital arithmetic. An effective approach for obtaining parallel and carry-free implementations consists in using the Montgomery exponentiation algorithm and executing the necessary operations in RNS. Two efficient methods for performing the RNS Montgomery exponentiation have been proposed by Kawamura et al. and by Bajard and Imbert. The above approaches mainly differ in the algorithm used for implementing the base extension. This paper presents a modified RNS Montgomery exponentiation algorithm, where several multiplications are moved outside the main execution loop and replaced by an effective pre-processing stage producing a significant saving on the overall delay with respect to state-of-the-art approaches. Since the proposed modification should be applied to both of the above algorithms, two versions are specifically discussed.
{"title":"A General Approach for Improving RNS Montgomery Exponentiation Using Pre-processing","authors":"F. Gandino, F. Lamberti, P. Montuschi, J. Bajard","doi":"10.1109/ARITH.2011.35","DOIUrl":"https://doi.org/10.1109/ARITH.2011.35","url":null,"abstract":"The hardware implementation of modular exponentiation for very large integers is a well-known topic in digital arithmetic. An effective approach for obtaining parallel and carry-free implementations consists in using the Montgomery exponentiation algorithm and executing the necessary operations in RNS. Two efficient methods for performing the RNS Montgomery exponentiation have been proposed by Kawamura et al. and by Bajard and Imbert. The above approaches mainly differ in the algorithm used for implementing the base extension. This paper presents a modified RNS Montgomery exponentiation algorithm, where several multiplications are moved outside the main execution loop and replaced by an effective pre-processing stage producing a significant saving on the overall delay with respect to state-of-the-art approaches. Since the proposed modification should be applied to both of the above algorithms, two versions are specifically discussed.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134339839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an IEEE 754-2008 and ARM compliant floating-point micro architecture that preserves the higher performance of separate multiply and add units while decreasing the effective latency of fused multiply-adds (FMAs). The multiplier supports subnormals in a novel and faster manner, shifting the partial products so that injection rounding can be used. The early-normalizing adder retains the low latency of a split path near/far adder, but does so in a unified path with less area. The adder also allows rounding on effective subtractions involving one input that is twice the normal width, a necessary feature for handling FMAs. The resulting floating-point unit has about twice the (IPC) performance of the best previous ARM design, and can be clocked at a higher speed despite the wider paths required by FMAs.
{"title":"Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines","authors":"D. Lutz","doi":"10.1109/ARITH.2011.25","DOIUrl":"https://doi.org/10.1109/ARITH.2011.25","url":null,"abstract":"We present an IEEE 754-2008 and ARM compliant floating-point micro architecture that preserves the higher performance of separate multiply and add units while decreasing the effective latency of fused multiply-adds (FMAs). The multiplier supports subnormals in a novel and faster manner, shifting the partial products so that injection rounding can be used. The early-normalizing adder retains the low latency of a split path near/far adder, but does so in a unified path with less area. The adder also allows rounding on effective subtractions involving one input that is twice the normal width, a necessary feature for handling FMAs. The resulting floating-point unit has about twice the (IPC) performance of the best previous ARM design, and can be clocked at a higher speed despite the wider paths required by FMAs.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115768454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Digit-by-rounding algorithms enable efficient hardware implementations of algebraic functions such as the reciprocal, square root, or reciprocal square root, but certifying the correctness of such algorithms is a nontrivial endeavor. Traditionally, sufficient conditions for correctness are derived as closed-form formulae relating key design parameters. These sufficient conditions, however, often prove stricter than necessary, excluding correct and efficient designs. In this paper, we present a rigorous, computer-aided method for correctness certification that better approximates the necessary conditions, lowering the risk of rejecting correct designs. We also present two specific applications of this method. First, when applied to a conventional digit-by-rounding reciprocal square root design, our method enabled a fourfold reduction in lookup table size relative to the minimum dictated by a standard sufficient condition. Second, our method certified the correctness of a novel reciprocal square root design that we developed to parallelize two computational steps whose sequential execution lies on the critical path of conventional designs. The difficulty in deriving closed-form sufficient conditions to ascertain this design's correctness provided the original motivation for development of the new certification method.
{"title":"Tight Certification Techniques for Digit-by-Rounding Algorithms with Application to a New 1/sqrt(x) Design","authors":"P. T. P. Tang, J. A. Butts, R. Dror, D. Shaw","doi":"10.1109/ARITH.2011.29","DOIUrl":"https://doi.org/10.1109/ARITH.2011.29","url":null,"abstract":"Digit-by-rounding algorithms enable efficient hardware implementations of algebraic functions such as the reciprocal, square root, or reciprocal square root, but certifying the correctness of such algorithms is a nontrivial endeavor. Traditionally, sufficient conditions for correctness are derived as closed-form formulae relating key design parameters. These sufficient conditions, however, often prove stricter than necessary, excluding correct and efficient designs. In this paper, we present a rigorous, computer-aided method for correctness certification that better approximates the necessary conditions, lowering the risk of rejecting correct designs. We also present two specific applications of this method. First, when applied to a conventional digit-by-rounding reciprocal square root design, our method enabled a fourfold reduction in lookup table size relative to the minimum dictated by a standard sufficient condition. Second, our method certified the correctness of a novel reciprocal square root design that we developed to parallelize two computational steps whose sequential execution lies on the critical path of conventional designs. The difficulty in deriving closed-form sufficient conditions to ascertain this design's correctness provided the original motivation for development of the new certification method.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127817948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Signed digit (SD) number systems allow for high performance carry-free adders. Maximally redundant SD (MRSD) alternatives provide maximal encoding efficiency among Radix-2^h SD number systems, whereby value of h tunes the area-time trade-off. Straightforward implementation of the conventional carry-free addition algorithm requires three O(log h) addition-like operations in sequence. However, there are several MRSD implementations with only one such operation. Some of them are delay optimized, but suffer from extensive hardware redundancy, while some other equally fast adders show less power/area consumption. A careful study of the latter cases hints on variety of improvement options, based on which and a new transfer computation technique, we develop a family of faster MRSD adders that consume less power/area than all the previous relevant works. They also fit efficiently within the redundant digit floating point addition scheme. However, similar to their relevant ancestor designs, suffer from an inherent property of MRSD adders, i.e., difficulty of handling hidden leading zero-digits. To remedy this problem, we use less redundant SD representations, where our transfer extraction method applies efficiently and leads to far less complex leading zero-digit detection. All the presented designs are supported by exhaustive correctness tests and performance evaluation via 0.13 micrometer CMOS technology synthesis.
{"title":"A Family of High Radix Signed Digit Adders","authors":"S. Gorgin, G. Jaberipur","doi":"10.1109/ARITH.2011.24","DOIUrl":"https://doi.org/10.1109/ARITH.2011.24","url":null,"abstract":"Signed digit (SD) number systems allow for high performance carry-free adders. Maximally redundant SD (MRSD) alternatives provide maximal encoding efficiency among Radix-2^h SD number systems, whereby value of h tunes the area-time trade-off. Straightforward implementation of the conventional carry-free addition algorithm requires three O(log h) addition-like operations in sequence. However, there are several MRSD implementations with only one such operation. Some of them are delay optimized, but suffer from extensive hardware redundancy, while some other equally fast adders show less power/area consumption. A careful study of the latter cases hints on variety of improvement options, based on which and a new transfer computation technique, we develop a family of faster MRSD adders that consume less power/area than all the previous relevant works. They also fit efficiently within the redundant digit floating point addition scheme. However, similar to their relevant ancestor designs, suffer from an inherent property of MRSD adders, i.e., difficulty of handling hidden leading zero-digits. To remedy this problem, we use less redundant SD representations, where our transfer extraction method applies efficiently and leads to far less complex leading zero-digit detection. All the presented designs are supported by exhaustive correctness tests and performance evaluation via 0.13 micrometer CMOS technology synthesis.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131425695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper discusses about High Performance Computing including the introduction of the fused multiply-add dataflow, and innovations in vector computing and multi processing. This has led to a new era in high performance that has created human intelligence in computers.
{"title":"High Intelligence Computing: The New Era of High Performance Computing","authors":"Ralf Fischer","doi":"10.1109/ARITH.2011.42","DOIUrl":"https://doi.org/10.1109/ARITH.2011.42","url":null,"abstract":"This paper discusses about High Performance Computing including the introduction of the fused multiply-add dataflow, and innovations in vector computing and multi processing. This has led to a new era in high performance that has created human intelligence in computers.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132101159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Brisebarre, Mioara Joldes, Peter Kornerup, Érik Martin-Dorel, J. Muller
Define an "augmented precision" algorithm as an algorithm that returns, in precision-p floating-point arithmetic, its result as the unevaluated sum of two floating-point numbers, with a relative error of the order of 2^(-2p). Assuming an FMA instruction is available, we perform a tight error analysis of an augmented precision algorithm for the square root, and introduce two slightly different augmented precision algorithms for the 2D-norm sqrt(x^2+y^2). Then we give tight lower bounds on the minimum distance (in ulps) between sqrt(x^2+y^2) and a midpoint when sqrt(x^2+y^2) is not itself a midpoint. This allows us to determine cases when our algorithms make it possible to return correctly-rounded 2D-norms.
{"title":"Augmented Precision Square Roots and 2-D Norms, and Discussion on Correctly Rounding sqrt(x^2+y^2)","authors":"N. Brisebarre, Mioara Joldes, Peter Kornerup, Érik Martin-Dorel, J. Muller","doi":"10.1109/ARITH.2011.13","DOIUrl":"https://doi.org/10.1109/ARITH.2011.13","url":null,"abstract":"Define an \"augmented precision\" algorithm as an algorithm that returns, in precision-p floating-point arithmetic, its result as the unevaluated sum of two floating-point numbers, with a relative error of the order of 2^(-2p). Assuming an FMA instruction is available, we perform a tight error analysis of an augmented precision algorithm for the square root, and introduce two slightly different augmented precision algorithms for the 2D-norm sqrt(x^2+y^2). Then we give tight lower bounds on the minimum distance (in ulps) between sqrt(x^2+y^2) and a midpoint when sqrt(x^2+y^2) is not itself a midpoint. This allows us to determine cases when our algorithms make it possible to return correctly-rounded 2D-norms.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134250296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The logarithmic number system has been proposed as an alternative to floating-point arithmetic. Multiplication, division and square-root operations are accomplished with fixed-point methods, but addition and subtraction are considerably more challenging. Recent work has demonstrated that these operations too can be done with similar speed and accuracy to their FP equivalents, but the necessary circuitry is complex. In particular, it is dominated by the need for large ROM tables for the storage of non-linear functions. This paper describes two algorithms, a new co-transformation procedure and an improvement to an existing interpolation method, that reduce these tables to an extent that allows their easy synthesis in logic. An implementation shows substantial reductions in area and delay from the previous best 32-bit realisation, with equivalent accuracy.
{"title":"ROM-less LNS","authors":"Rizalafande Che Ismail, J. N. Coleman","doi":"10.1109/ARITH.2011.15","DOIUrl":"https://doi.org/10.1109/ARITH.2011.15","url":null,"abstract":"The logarithmic number system has been proposed as an alternative to floating-point arithmetic. Multiplication, division and square-root operations are accomplished with fixed-point methods, but addition and subtraction are considerably more challenging. Recent work has demonstrated that these operations too can be done with similar speed and accuracy to their FP equivalents, but the necessary circuitry is complex. In particular, it is dominated by the need for large ROM tables for the storage of non-linear functions. This paper describes two algorithms, a new co-transformation procedure and an improvement to an existing interpolation method, that reduce these tables to an extent that allows their easy synthesis in logic. An implementation shows substantial reductions in area and delay from the previous best 32-bit realisation, with equivalent accuracy.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128713033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a number of new high-radix ripple-carry adder designs based on Ling's addition technique and a recently-published expansion thereof. The proposed adders all have one inverting CMOS cell per stage along the carry-in to carry-out critical path and, at 16-b word lengths, the fastest of them matches the speed of a 16-b prefix adder for only 63% of the area. These adders will be of use in VLSI circuits implementing modern wireless DSP algorithms and in Floating-Point Unit exponent logic, both of which typically use short word length arithmetic.
{"title":"Fast Ripple-Carry Adders in Standard-Cell CMOS VLSI","authors":"N. Burgess","doi":"10.1109/ARITH.2011.23","DOIUrl":"https://doi.org/10.1109/ARITH.2011.23","url":null,"abstract":"This paper presents a number of new high-radix ripple-carry adder designs based on Ling's addition technique and a recently-published expansion thereof. The proposed adders all have one inverting CMOS cell per stage along the carry-in to carry-out critical path and, at 16-b word lengths, the fastest of them matches the speed of a 16-b prefix adder for only 63% of the area. These adders will be of use in VLSI circuits implementing modern wireless DSP algorithms and in Floating-Point Unit exponent logic, both of which typically use short word length arithmetic.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"214 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123693534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Jeannerod, Jingyan Jourdan-Lu, Christophe Monat, G. Revy
We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how to exploit the specific properties of squaring in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithms are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from ST Microelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.
{"title":"How to Square Floats Accurately and Efficiently on the ST231 Integer Processor","authors":"C. Jeannerod, Jingyan Jourdan-Lu, Christophe Monat, G. Revy","doi":"10.1109/ARITH.2011.19","DOIUrl":"https://doi.org/10.1109/ARITH.2011.19","url":null,"abstract":"We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how to exploit the specific properties of squaring in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithms are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from ST Microelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127875096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}