A hardware algorithm for integer division is proposed. It is based on the digit-recurrence, non-restoring division algorithm. Fast computation is achieved by the use of the radix-2 signed-digit representation. The algorithm does not require normalization of the divisor, and hence, does not require area-consuming leading one (or zero) detection nor shifts of variable-amount. Combinational (unfolded) implementation of the algorithm yields a regularly structured array divider, where pipelining is possible for increasing the throughput. Sequential implementation yields a compact divider.
{"title":"A hardware algorithm for integer division","authors":"N. Takagi, Shunsuke Kadowaki, K. Takagi","doi":"10.1109/ARITH.2005.6","DOIUrl":"https://doi.org/10.1109/ARITH.2005.6","url":null,"abstract":"A hardware algorithm for integer division is proposed. It is based on the digit-recurrence, non-restoring division algorithm. Fast computation is achieved by the use of the radix-2 signed-digit representation. The algorithm does not require normalization of the divisor, and hence, does not require area-consuming leading one (or zero) detection nor shifts of variable-amount. Combinational (unfolded) implementation of the algorithm yields a regularly structured array divider, where pipelining is possible for increasing the throughput. Sequential implementation yields a compact divider.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116938638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hash functions are an important cryptographic primitive. They are used to obtain a fixed-size fingerprint, or hash value, of an arbitrary long message. We focus particularly on the class of dedicated hash functions, whose general construction is presented; the peculiar arrangement of sequential and combinational units makes the application of pipelining techniques to these constructions not trivial. We formalize an optimization technique called quasi-pipelining, whose goal is to optimize the critical path and thus to increase the clock frequency in dedicated hardware implementations. The SHA-2 algorithm has been previously examined by Dadda et al, with specific versions of quasi-pipelining; a full generalization of the technique is presented, along with application to the SHA-1 algorithm. Quasi-pipelining could be as well applied to future hashing algorithms, provided they are designed along the same lines as those of the SHA family.
{"title":"Quasi-pipelined hash circuits","authors":"Marco Macchetti, L. Dadda","doi":"10.1109/ARITH.2005.36","DOIUrl":"https://doi.org/10.1109/ARITH.2005.36","url":null,"abstract":"Hash functions are an important cryptographic primitive. They are used to obtain a fixed-size fingerprint, or hash value, of an arbitrary long message. We focus particularly on the class of dedicated hash functions, whose general construction is presented; the peculiar arrangement of sequential and combinational units makes the application of pipelining techniques to these constructions not trivial. We formalize an optimization technique called quasi-pipelining, whose goal is to optimize the critical path and thus to increase the clock frequency in dedicated hardware implementations. The SHA-2 algorithm has been previously examined by Dadda et al, with specific versions of quasi-pipelining; a full generalization of the technique is presented, along with application to the SHA-1 algorithm. Quasi-pipelining could be as well applied to future hashing algorithms, provided they are designed along the same lines as those of the SHA family.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"222 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122524028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saturating counters are a newly defined class of generalized parallel counters that provide the exact number of inputs which are equal to 1 only if this number is below a given threshold. Such counters are useful in, for example, self-test and repair units for embedded memories. This paper defines saturating counters for arbitrary threshold values and presents several alternatives for their implementation. The delay and area of the proposed design alternatives are then estimated using a 0.25/spl mu/m cell library. Finally, we study the behavior of saturating counters when the threshold approaches the number of input bits, i.e., the special case of non-saturating parallel counters.
{"title":"Synthesis of saturating counters using traditional and non-traditional basic counters","authors":"Zhaojun Wo, I. Koren","doi":"10.1109/ARITH.2005.42","DOIUrl":"https://doi.org/10.1109/ARITH.2005.42","url":null,"abstract":"Saturating counters are a newly defined class of generalized parallel counters that provide the exact number of inputs which are equal to 1 only if this number is below a given threshold. Such counters are useful in, for example, self-test and repair units for embedded memories. This paper defines saturating counters for arbitrary threshold values and presents several alternatives for their implementation. The delay and area of the proposed design alternatives are then estimated using a 0.25/spl mu/m cell library. Finally, we study the behavior of saturating counters when the threshold approaches the number of input bits, i.e., the special case of non-saturating parallel counters.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130584073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable using a fused-mac. Then, we show that, with rounding to the nearest, the error of a fused-mac instruction is exactly representable as the sum of two floating-point numbers. We give an algorithm that computes that error.
{"title":"Some functions computable with a fused-mac","authors":"S. Boldo, J. Muller","doi":"10.1109/ARITH.2005.39","DOIUrl":"https://doi.org/10.1109/ARITH.2005.39","url":null,"abstract":"The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable using a fused-mac. Then, we show that, with rounding to the nearest, the error of a fused-mac instruction is exactly representable as the sum of two floating-point numbers. We give an algorithm that computes that error.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124537285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. M. Müller, C. Jacobi, H. Oh, K. Tran, S. Cottier, B. Michael, H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano, T. Machida, S. Dhong
The floating-point unit in the synergistic processor element of the 1st generation multi-core CELL processor is described. The FPU supports 4-way SIMD single precision and integer operations and 2-way SIMD double precision operations. The design required a high-frequency, low latency, power and area efficiency with primary application to the multimedia streaming workloads, such as 3D graphics. The FPU has 3 different latencies, optimizing the performance critical single precision FMA operations, which are executed with a 6-cycle latency at an 11FO4 cycle time. The latency includes the global forwarding of the result. These challenging performance, power, and area goals were achieved through the co-design of architecture and implementation with optimizations at all levels of the design. This paper focuses on the logical and algorithmic aspects of the FPU we developed, to achieve these goals.
{"title":"The vector floating-point unit in a synergistic processor element of a CELL processor","authors":"S. M. Müller, C. Jacobi, H. Oh, K. Tran, S. Cottier, B. Michael, H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano, T. Machida, S. Dhong","doi":"10.1109/ARITH.2005.45","DOIUrl":"https://doi.org/10.1109/ARITH.2005.45","url":null,"abstract":"The floating-point unit in the synergistic processor element of the 1st generation multi-core CELL processor is described. The FPU supports 4-way SIMD single precision and integer operations and 2-way SIMD double precision operations. The design required a high-frequency, low latency, power and area efficiency with primary application to the multimedia streaming workloads, such as 3D graphics. The FPU has 3 different latencies, optimizing the performance critical single precision FMA operations, which are executed with a 6-cycle latency at an 11FO4 cycle time. The latency includes the global forwarding of the result. These challenging performance, power, and area goals were achieved through the co-design of architecture and implementation with optimizations at all levels of the design. This paper focuses on the logical and algorithmic aspects of the FPU we developed, to achieve these goals.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131045290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The reciprocal and square-root reciprocal operations are important in several applications. For these operations, we present algorithms that combine a digit-by-digit module and one iteration of a quadratic-convergence approximation. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this way, both parts execute in an overlapped manner, so that the total number of cycles is about half of the number that would be required by the digit-by-digit part alone. Because of the approximation, correct rounding of the result cannot be obtained directly in all cases; we propose a variable-time implementation that produces the correctly rounded result with a small average overhead. Radix-4 implementations are described and have been synthesized. They achieve the same cycle time as the standard digit-by-digit implementation, resulting in a speed-up of about 2 and, because of the approximation part, the area factor is also about 2. We also show a combined implementation for both operations that has essentially the same complexity as that for square-root reciprocal alone.
{"title":"Low latency digit-recurrence reciprocal and square-root reciprocal algorithm and architecture","authors":"E. Antelo, T. Lang, P. Montuschi, A. Nannarelli","doi":"10.1109/ARITH.2005.29","DOIUrl":"https://doi.org/10.1109/ARITH.2005.29","url":null,"abstract":"The reciprocal and square-root reciprocal operations are important in several applications. For these operations, we present algorithms that combine a digit-by-digit module and one iteration of a quadratic-convergence approximation. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this way, both parts execute in an overlapped manner, so that the total number of cycles is about half of the number that would be required by the digit-by-digit part alone. Because of the approximation, correct rounding of the result cannot be obtained directly in all cases; we propose a variable-time implementation that produces the correctly rounded result with a small average overhead. Radix-4 implementations are described and have been synthesized. They achieve the same cycle time as the standard digit-by-digit implementation, resulting in a speed-up of about 2 and, because of the approximation part, the area factor is also about 2. We also show a combined implementation for both operations that has essentially the same complexity as that for square-root reciprocal alone.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127138825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose the first general multiplication algorithm in GF(2/sup k/) with a subquadratic area complexity of O(k/sup 8/5/) = O(k/sup 1.6/). Using the Chinese remainder theorem, we represent the elements of GF(2/sup k/); i.e. the polynomials in GF(2) [X] of degree at most k-1, by their remainder modulo a set of n pairwise prime trinomials, T/sub 1/,...,T/sub n/, of degree d and such that nd /spl ges/ k. Our algorithm is based on Montgomery's multiplication applied to the ring formed by the direct product of the trinomials.
{"title":"Parallel Montgomery multiplication in GF(2/sup k/) using trinomial residue arithmetic","authors":"J. Bajard, L. Imbert, G. Jullien","doi":"10.1109/ARITH.2005.34","DOIUrl":"https://doi.org/10.1109/ARITH.2005.34","url":null,"abstract":"We propose the first general multiplication algorithm in GF(2/sup k/) with a subquadratic area complexity of O(k/sup 8/5/) = O(k/sup 1.6/). Using the Chinese remainder theorem, we represent the elements of GF(2/sup k/); i.e. the polynomials in GF(2) [X] of degree at most k-1, by their remainder modulo a set of n pairwise prime trinomials, T/sub 1/,...,T/sub n/, of degree d and such that nd /spl ges/ k. Our algorithm is based on Montgomery's multiplication applied to the ring formed by the direct product of the trinomials.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127142501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a radix-2 online computational scheme for evaluating multinomials in a fixed-point number representation system. Its main advantage is that it can adapt to any evaluation graph representing the multinomial. Evaluation graphs are efficient representations of multinomials in a factored form. The proposed scheme maps subgraphs of the evaluation graph using linear-system operators. These operators transform the expressions represented by the subgraphs into systems of linear equations. The linear equations are then solved in an online, most-significant-digit-first fashion. The scheme produces, after an initial delay, one output digit per iteration for inputs within range. The iteration time is equal to the sum of the delays of a redundant adder, multiplexer, register and a selection unit and is independent of the size of the multinomial and the precision of the inputs/outputs. The initial delay is proportional to the diameter of the evaluation graph and the maximum number of children of any addition node in the graph. The proposed method lends itself to implementation using simple, highly regular hardware with serial interconnections between modules.
{"title":"A linear-system operator based scheme for evaluation of multinomials","authors":"P. Adharapurapu, M. Ercegovac","doi":"10.1109/ARITH.2005.8","DOIUrl":"https://doi.org/10.1109/ARITH.2005.8","url":null,"abstract":"We present a radix-2 online computational scheme for evaluating multinomials in a fixed-point number representation system. Its main advantage is that it can adapt to any evaluation graph representing the multinomial. Evaluation graphs are efficient representations of multinomials in a factored form. The proposed scheme maps subgraphs of the evaluation graph using linear-system operators. These operators transform the expressions represented by the subgraphs into systems of linear equations. The linear equations are then solved in an online, most-significant-digit-first fashion. The scheme produces, after an initial delay, one output digit per iteration for inputs within range. The iteration time is equal to the sum of the delays of a redundant adder, multiplexer, register and a selection unit and is independent of the size of the multinomial and the precision of the inputs/outputs. The initial delay is proportional to the diameter of the evaluation graph and the maximum number of children of any addition node in the graph. The proposed method lends itself to implementation using simple, highly regular hardware with serial interconnections between modules.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We develop the foundations for confirming monotonicity of a multi-term reciprocal function approximation. We introduce the concept of operand recoding to improve the accuracy of multipartite approximation. The results are applied to provide a proposed four-partite reciprocal implementation with total table size /spl sim/27 Kbytes, that yields an IEEE standard, single precision sized format (24 bit) reciprocal instruction, that is a one-ulp monotonic reciprocal.
{"title":"Single precision reciprocals by multipartite table lookup","authors":"Peter Kornerup, D. Matula","doi":"10.1109/ARITH.2005.37","DOIUrl":"https://doi.org/10.1109/ARITH.2005.37","url":null,"abstract":"We develop the foundations for confirming monotonicity of a multi-term reciprocal function approximation. We introduce the concept of operand recoding to improve the accuracy of multipartite approximation. The results are applied to provide a proposed four-partite reciprocal implementation with total table size /spl sim/27 Kbytes, that yields an IEEE standard, single precision sized format (24 bit) reciprocal instruction, that is a one-ulp monotonic reciprocal.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121571025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a technique for designing linear and quadratic interpolators for function approximation using truncated multipliers and squarers. Initial coefficient values are found using a Chebyshev series approximation, and then adjusted through exhaustive simulation to minimize the maximum absolute error of the interpolator output. This technique is suitable for any function and any precision up to 24-bits (IEEE single precision). Designs for linear and quadratic interpolators that implement the reciprocal function, f(x)=1/x, are presented and analyzed as an example. We show that a 24-bit truncated reciprocal quadratic interpolator with a design specification /spl plusmn/1 ulp error requires 24.1% fewer partial products to implement than a comparable standard interpolator with the same error specification.
{"title":"Efficient function approximation using truncated multipliers and squarers","authors":"E. G. Walters, M. Schulte","doi":"10.1109/ARITH.2005.18","DOIUrl":"https://doi.org/10.1109/ARITH.2005.18","url":null,"abstract":"This paper presents a technique for designing linear and quadratic interpolators for function approximation using truncated multipliers and squarers. Initial coefficient values are found using a Chebyshev series approximation, and then adjusted through exhaustive simulation to minimize the maximum absolute error of the interpolator output. This technique is suitable for any function and any precision up to 24-bits (IEEE single precision). Designs for linear and quadratic interpolators that implement the reciprocal function, f(x)=1/x, are presented and analyzed as an example. We show that a 24-bit truncated reciprocal quadratic interpolator with a design specification /spl plusmn/1 ulp error requires 24.1% fewer partial products to implement than a comparable standard interpolator with the same error specification.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121137335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}