The evaluation of special functions often involves the evaluation of numerical constants. When the precision of the evaluation is known in advance (e.g., when developing libms) these constants are simply precomputed once and for all. In contrast, when the precision is dynamically chosen by the user (e.g., in multiple precision libraries), the constants must be evaluated on the fly at the required precision and with a rigorous error bound. Often, such constants are easily obtained by means of formulas involving simple numbers and functions. In principle, it is not a difficult work to write multiple precision code for evaluating such formulas with a rigorous round off analysis: one only has to study how round off errors propagate through sub expressions. However, this work is painful and error-prone and it is difficult for a human being to be perfectly rigorous in this process. Moreover, the task quickly becomes impractical when the size of the formula grows. In this article, we present an algorithm that takes as input a constant formula and that automatically produces code for evaluating it in arbitrary precision with a rigorous error bound. It has been implemented in the Solly a free software tool and its behavior is illustrated on several examples.
{"title":"Automatic Generation of Code for the Evaluation of Constant Expressions at Any Precision with a Guaranteed Error Bound","authors":"S. Chevillard","doi":"10.1109/ARITH.2011.38","DOIUrl":"https://doi.org/10.1109/ARITH.2011.38","url":null,"abstract":"The evaluation of special functions often involves the evaluation of numerical constants. When the precision of the evaluation is known in advance (e.g., when developing libms) these constants are simply precomputed once and for all. In contrast, when the precision is dynamically chosen by the user (e.g., in multiple precision libraries), the constants must be evaluated on the fly at the required precision and with a rigorous error bound. Often, such constants are easily obtained by means of formulas involving simple numbers and functions. In principle, it is not a difficult work to write multiple precision code for evaluating such formulas with a rigorous round off analysis: one only has to study how round off errors propagate through sub expressions. However, this work is painful and error-prone and it is difficult for a human being to be perfectly rigorous in this process. Moreover, the task quickly becomes impractical when the size of the formula grows. In this article, we present an algorithm that takes as input a constant formula and that automatically produces code for evaluating it in arbitrary precision with a rigorous error bound. It has been implemented in the Solly a free software tool and its behavior is illustrated on several examples.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130159002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Field Programmable Gate Arrays (FPGAs) are conventionally considered as 'glue-logic'. However, modern FPGAs are extremely competitive compared to state-of-the-art CPUs for commercial HPC workloads, such as those found in Oil and Gas and Finance. For example, an FPGA accelerated system can be 31-37 times faster than an equivalently sized conventional machine, and consume 1/39 of the power. The key to achieving the best performance in FPGA accelerators, while maintaining correctness, is optimization of arithmetic units and data types to suit the range/precision at each point in the computation. The flexibility of the FPGA to implement non-standard arithmetic, combined with a data-flow programming model that instantiates a separate unit for each arithmetic operator in the code provides a wide design space. As such, FPGA computing offers significant opportunity for arithmetic research into 'large scale' HPC applications, where there is an opportunity to move away from standard IEEE formats, either to improve precision compared to the CPU version or to increase speed.
{"title":"Accelerating Large-Scale HPC Applications Using FPGAs","authors":"R. Dimond, S. Racanière, O. Pell","doi":"10.1109/ARITH.2011.34","DOIUrl":"https://doi.org/10.1109/ARITH.2011.34","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are conventionally considered as 'glue-logic'. However, modern FPGAs are extremely competitive compared to state-of-the-art CPUs for commercial HPC workloads, such as those found in Oil and Gas and Finance. For example, an FPGA accelerated system can be 31-37 times faster than an equivalently sized conventional machine, and consume 1/39 of the power. The key to achieving the best performance in FPGA accelerators, while maintaining correctness, is optimization of arithmetic units and data types to suit the range/precision at each point in the computation. The flexibility of the FPGA to implement non-standard arithmetic, combined with a data-flow programming model that instantiates a separate unit for each arithmetic operator in the code provides a wide design space. As such, FPGA computing offers significant opportunity for arithmetic research into 'large scale' HPC applications, where there is an opportunity to move away from standard IEEE formats, either to improve precision compared to the CPU version or to increase speed.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130761756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The logarithmic number system has been proposed as an alternative to floating-point arithmetic. Multiplication, division and square-root operations are accomplished with fixed-point methods, but addition and subtraction are considerably more challenging. Recent work has demonstrated that these operations too can be done with similar speed and accuracy to their FP equivalents, but the necessary circuitry is complex. In particular, it is dominated by the need for large ROM tables for the storage of non-linear functions. This paper describes two algorithms, a new co-transformation procedure and an improvement to an existing interpolation method, that reduce these tables to an extent that allows their easy synthesis in logic. An implementation shows substantial reductions in area and delay from the previous best 32-bit realisation, with equivalent accuracy.
{"title":"ROM-less LNS","authors":"Rizalafande Che Ismail, J. N. Coleman","doi":"10.1109/ARITH.2011.15","DOIUrl":"https://doi.org/10.1109/ARITH.2011.15","url":null,"abstract":"The logarithmic number system has been proposed as an alternative to floating-point arithmetic. Multiplication, division and square-root operations are accomplished with fixed-point methods, but addition and subtraction are considerably more challenging. Recent work has demonstrated that these operations too can be done with similar speed and accuracy to their FP equivalents, but the necessary circuitry is complex. In particular, it is dominated by the need for large ROM tables for the storage of non-linear functions. This paper describes two algorithms, a new co-transformation procedure and an improvement to an existing interpolation method, that reduce these tables to an extent that allows their easy synthesis in logic. An implementation shows substantial reductions in area and delay from the previous best 32-bit realisation, with equivalent accuracy.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128713033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Digit-by-rounding algorithms enable efficient hardware implementations of algebraic functions such as the reciprocal, square root, or reciprocal square root, but certifying the correctness of such algorithms is a nontrivial endeavor. Traditionally, sufficient conditions for correctness are derived as closed-form formulae relating key design parameters. These sufficient conditions, however, often prove stricter than necessary, excluding correct and efficient designs. In this paper, we present a rigorous, computer-aided method for correctness certification that better approximates the necessary conditions, lowering the risk of rejecting correct designs. We also present two specific applications of this method. First, when applied to a conventional digit-by-rounding reciprocal square root design, our method enabled a fourfold reduction in lookup table size relative to the minimum dictated by a standard sufficient condition. Second, our method certified the correctness of a novel reciprocal square root design that we developed to parallelize two computational steps whose sequential execution lies on the critical path of conventional designs. The difficulty in deriving closed-form sufficient conditions to ascertain this design's correctness provided the original motivation for development of the new certification method.
{"title":"Tight Certification Techniques for Digit-by-Rounding Algorithms with Application to a New 1/sqrt(x) Design","authors":"P. T. P. Tang, J. A. Butts, R. Dror, D. Shaw","doi":"10.1109/ARITH.2011.29","DOIUrl":"https://doi.org/10.1109/ARITH.2011.29","url":null,"abstract":"Digit-by-rounding algorithms enable efficient hardware implementations of algebraic functions such as the reciprocal, square root, or reciprocal square root, but certifying the correctness of such algorithms is a nontrivial endeavor. Traditionally, sufficient conditions for correctness are derived as closed-form formulae relating key design parameters. These sufficient conditions, however, often prove stricter than necessary, excluding correct and efficient designs. In this paper, we present a rigorous, computer-aided method for correctness certification that better approximates the necessary conditions, lowering the risk of rejecting correct designs. We also present two specific applications of this method. First, when applied to a conventional digit-by-rounding reciprocal square root design, our method enabled a fourfold reduction in lookup table size relative to the minimum dictated by a standard sufficient condition. Second, our method certified the correctness of a novel reciprocal square root design that we developed to parallelize two computational steps whose sequential execution lies on the critical path of conventional designs. The difficulty in deriving closed-form sufficient conditions to ascertain this design's correctness provided the original motivation for development of the new certification method.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127817948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an IEEE 754-2008 and ARM compliant floating-point micro architecture that preserves the higher performance of separate multiply and add units while decreasing the effective latency of fused multiply-adds (FMAs). The multiplier supports subnormals in a novel and faster manner, shifting the partial products so that injection rounding can be used. The early-normalizing adder retains the low latency of a split path near/far adder, but does so in a unified path with less area. The adder also allows rounding on effective subtractions involving one input that is twice the normal width, a necessary feature for handling FMAs. The resulting floating-point unit has about twice the (IPC) performance of the best previous ARM design, and can be clocked at a higher speed despite the wider paths required by FMAs.
{"title":"Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines","authors":"D. Lutz","doi":"10.1109/ARITH.2011.25","DOIUrl":"https://doi.org/10.1109/ARITH.2011.25","url":null,"abstract":"We present an IEEE 754-2008 and ARM compliant floating-point micro architecture that preserves the higher performance of separate multiply and add units while decreasing the effective latency of fused multiply-adds (FMAs). The multiplier supports subnormals in a novel and faster manner, shifting the partial products so that injection rounding can be used. The early-normalizing adder retains the low latency of a split path near/far adder, but does so in a unified path with less area. The adder also allows rounding on effective subtractions involving one input that is twice the normal width, a necessary feature for handling FMAs. The resulting floating-point unit has about twice the (IPC) performance of the best previous ARM design, and can be clocked at a higher speed despite the wider paths required by FMAs.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115768454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Signed digit (SD) number systems allow for high performance carry-free adders. Maximally redundant SD (MRSD) alternatives provide maximal encoding efficiency among Radix-2^h SD number systems, whereby value of h tunes the area-time trade-off. Straightforward implementation of the conventional carry-free addition algorithm requires three O(log h) addition-like operations in sequence. However, there are several MRSD implementations with only one such operation. Some of them are delay optimized, but suffer from extensive hardware redundancy, while some other equally fast adders show less power/area consumption. A careful study of the latter cases hints on variety of improvement options, based on which and a new transfer computation technique, we develop a family of faster MRSD adders that consume less power/area than all the previous relevant works. They also fit efficiently within the redundant digit floating point addition scheme. However, similar to their relevant ancestor designs, suffer from an inherent property of MRSD adders, i.e., difficulty of handling hidden leading zero-digits. To remedy this problem, we use less redundant SD representations, where our transfer extraction method applies efficiently and leads to far less complex leading zero-digit detection. All the presented designs are supported by exhaustive correctness tests and performance evaluation via 0.13 micrometer CMOS technology synthesis.
{"title":"A Family of High Radix Signed Digit Adders","authors":"S. Gorgin, G. Jaberipur","doi":"10.1109/ARITH.2011.24","DOIUrl":"https://doi.org/10.1109/ARITH.2011.24","url":null,"abstract":"Signed digit (SD) number systems allow for high performance carry-free adders. Maximally redundant SD (MRSD) alternatives provide maximal encoding efficiency among Radix-2^h SD number systems, whereby value of h tunes the area-time trade-off. Straightforward implementation of the conventional carry-free addition algorithm requires three O(log h) addition-like operations in sequence. However, there are several MRSD implementations with only one such operation. Some of them are delay optimized, but suffer from extensive hardware redundancy, while some other equally fast adders show less power/area consumption. A careful study of the latter cases hints on variety of improvement options, based on which and a new transfer computation technique, we develop a family of faster MRSD adders that consume less power/area than all the previous relevant works. They also fit efficiently within the redundant digit floating point addition scheme. However, similar to their relevant ancestor designs, suffer from an inherent property of MRSD adders, i.e., difficulty of handling hidden leading zero-digits. To remedy this problem, we use less redundant SD representations, where our transfer extraction method applies efficiently and leads to far less complex leading zero-digit detection. All the presented designs are supported by exhaustive correctness tests and performance evaluation via 0.13 micrometer CMOS technology synthesis.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131425695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper discusses about High Performance Computing including the introduction of the fused multiply-add dataflow, and innovations in vector computing and multi processing. This has led to a new era in high performance that has created human intelligence in computers.
{"title":"High Intelligence Computing: The New Era of High Performance Computing","authors":"Ralf Fischer","doi":"10.1109/ARITH.2011.42","DOIUrl":"https://doi.org/10.1109/ARITH.2011.42","url":null,"abstract":"This paper discusses about High Performance Computing including the introduction of the fused multiply-add dataflow, and innovations in vector computing and multi processing. This has led to a new era in high performance that has created human intelligence in computers.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132101159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Brisebarre, Mioara Joldes, Peter Kornerup, Érik Martin-Dorel, J. Muller
Define an "augmented precision" algorithm as an algorithm that returns, in precision-p floating-point arithmetic, its result as the unevaluated sum of two floating-point numbers, with a relative error of the order of 2^(-2p). Assuming an FMA instruction is available, we perform a tight error analysis of an augmented precision algorithm for the square root, and introduce two slightly different augmented precision algorithms for the 2D-norm sqrt(x^2+y^2). Then we give tight lower bounds on the minimum distance (in ulps) between sqrt(x^2+y^2) and a midpoint when sqrt(x^2+y^2) is not itself a midpoint. This allows us to determine cases when our algorithms make it possible to return correctly-rounded 2D-norms.
{"title":"Augmented Precision Square Roots and 2-D Norms, and Discussion on Correctly Rounding sqrt(x^2+y^2)","authors":"N. Brisebarre, Mioara Joldes, Peter Kornerup, Érik Martin-Dorel, J. Muller","doi":"10.1109/ARITH.2011.13","DOIUrl":"https://doi.org/10.1109/ARITH.2011.13","url":null,"abstract":"Define an \"augmented precision\" algorithm as an algorithm that returns, in precision-p floating-point arithmetic, its result as the unevaluated sum of two floating-point numbers, with a relative error of the order of 2^(-2p). Assuming an FMA instruction is available, we perform a tight error analysis of an augmented precision algorithm for the square root, and introduce two slightly different augmented precision algorithms for the 2D-norm sqrt(x^2+y^2). Then we give tight lower bounds on the minimum distance (in ulps) between sqrt(x^2+y^2) and a midpoint when sqrt(x^2+y^2) is not itself a midpoint. This allows us to determine cases when our algorithms make it possible to return correctly-rounded 2D-norms.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134250296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a number of new high-radix ripple-carry adder designs based on Ling's addition technique and a recently-published expansion thereof. The proposed adders all have one inverting CMOS cell per stage along the carry-in to carry-out critical path and, at 16-b word lengths, the fastest of them matches the speed of a 16-b prefix adder for only 63% of the area. These adders will be of use in VLSI circuits implementing modern wireless DSP algorithms and in Floating-Point Unit exponent logic, both of which typically use short word length arithmetic.
{"title":"Fast Ripple-Carry Adders in Standard-Cell CMOS VLSI","authors":"N. Burgess","doi":"10.1109/ARITH.2011.23","DOIUrl":"https://doi.org/10.1109/ARITH.2011.23","url":null,"abstract":"This paper presents a number of new high-radix ripple-carry adder designs based on Ling's addition technique and a recently-published expansion thereof. The proposed adders all have one inverting CMOS cell per stage along the carry-in to carry-out critical path and, at 16-b word lengths, the fastest of them matches the speed of a 16-b prefix adder for only 63% of the area. These adders will be of use in VLSI circuits implementing modern wireless DSP algorithms and in Floating-Point Unit exponent logic, both of which typically use short word length arithmetic.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"214 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123693534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Jeannerod, Jingyan Jourdan-Lu, Christophe Monat, G. Revy
We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how to exploit the specific properties of squaring in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithms are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from ST Microelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.
{"title":"How to Square Floats Accurately and Efficiently on the ST231 Integer Processor","authors":"C. Jeannerod, Jingyan Jourdan-Lu, Christophe Monat, G. Revy","doi":"10.1109/ARITH.2011.19","DOIUrl":"https://doi.org/10.1109/ARITH.2011.19","url":null,"abstract":"We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how to exploit the specific properties of squaring in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithms are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from ST Microelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127875096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}