Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378099
M. Schulte, E. Swartzlander
An algorithm is described which produces exactly rounded results for the functions of reciprocal, square root, 2/sup x/, and log 2/sup x/. Hardware designs based on this algorithm are presented for floating point numbers with 16- and 24-b significands. These designs use a polynomial approximation in which coefficients are originally selected based on the Chebyshev series approximation and are then adjusted to ensure exactly rounded results for all inputs. To reduce the number of terms in the approximation, the input interval is divided into subintervals of equal size and different coefficients are used for each subinterval. For floating point numbers with 16-b significands, the exactly rounded value of the function can be computed in 51 ns on a 20-mm/sup 2/ chip. For floating point numbers with 24-b significands, the functions can be computed in 80 ns on a 98-mm/sup 2/ chip.<>
{"title":"Exact rounding of certain elementary functions","authors":"M. Schulte, E. Swartzlander","doi":"10.1109/ARITH.1993.378099","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378099","url":null,"abstract":"An algorithm is described which produces exactly rounded results for the functions of reciprocal, square root, 2/sup x/, and log 2/sup x/. Hardware designs based on this algorithm are presented for floating point numbers with 16- and 24-b significands. These designs use a polynomial approximation in which coefficients are originally selected based on the Chebyshev series approximation and are then adjusted to ensure exactly rounded results for all inputs. To reduce the number of terms in the approximation, the input interval is divided into subintervals of equal size and different coefficients are used for each subinterval. For floating point numbers with 16-b significands, the exactly rounded value of the function can be computed in 51 ns on a 20-mm/sup 2/ chip. For floating point numbers with 24-b significands, the functions can be computed in 80 ns on a 98-mm/sup 2/ chip.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115507469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378115
D. Lewis
A logarithmic number system (LNS) arithmetic unit using a new method for polynomial interpolation in hardware is described. The use of an interleaved memory reduces storage requirements by allowing each stored function value to be used in interpolation across several segments. This strategy always uses fewer words of memory than an optimized polynomial with stored polynomial coefficients. Many accuracy requirements for the LNS arithmetic unit are possible, but a round to nearest cannot be easily achieved. The goal suggested here is to ensure that the worst case LNS relative error is smaller than the worst case FP relative error. Using the interleaved memory interpolator, the detailed design of an LNS arithmetic unit is performed using a second-order polynomial interpolator including approximately 91K bits of ROM.<>
{"title":"An accurate LNS arithmetic unit using interleaved memory function interpolator","authors":"D. Lewis","doi":"10.1109/ARITH.1993.378115","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378115","url":null,"abstract":"A logarithmic number system (LNS) arithmetic unit using a new method for polynomial interpolation in hardware is described. The use of an interleaved memory reduces storage requirements by allowing each stored function value to be used in interpolation across several segments. This strategy always uses fewer words of memory than an optimized polynomial with stored polynomial coefficients. Many accuracy requirements for the LNS arithmetic unit are possible, but a round to nearest cannot be easily achieved. The goal suggested here is to ensure that the worst case LNS relative error is smaller than the worst case FP relative error. Using the interleaved memory interpolator, the detailed design of an LNS arithmetic unit is performed using a second-order polynomial interpolator including approximately 91K bits of ROM.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121235118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378085
M. Shand, J. Vuillemin
The authors detail and analyze the critical techniques that may be combined in the design of fast hardware for RSA cryptography: chinese remainders, star chains, Hensel's odd division (also known as Montgomery modular reduction), carry-save representation, quotient pipelining, and asynchronous carry completion adders. A fully operational PAM (programmable active memory) implementation of RSA that combines all of the techniques presented here delivers an RSA secret decryption rate over 600-kb/s for 512-b keys, and 165-kb/s for 1-kb keys. This is an order of magnitude faster than any previously reported running implementation. While the implementation makes full use of the PAM's reconfigurability, it is possible to derive from the (multiple PAM designs) implementation a (single) gate-array specification with estimated size under 100 K gates and speed over 1 Mb/s for RSA 512-b keys. Matching gains in software performance which are also analyzed.<>
{"title":"Fast implementations of RSA cryptography","authors":"M. Shand, J. Vuillemin","doi":"10.1109/ARITH.1993.378085","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378085","url":null,"abstract":"The authors detail and analyze the critical techniques that may be combined in the design of fast hardware for RSA cryptography: chinese remainders, star chains, Hensel's odd division (also known as Montgomery modular reduction), carry-save representation, quotient pipelining, and asynchronous carry completion adders. A fully operational PAM (programmable active memory) implementation of RSA that combines all of the techniques presented here delivers an RSA secret decryption rate over 600-kb/s for 512-b keys, and 165-kb/s for 1-kb keys. This is an order of magnitude faster than any previously reported running implementation. While the implementation makes full use of the PAM's reconfigurability, it is possible to derive from the (multiple PAM designs) implementation a (single) gate-array specification with estimated size under 100 K gates and speed over 1 Mb/s for RSA 512-b keys. Matching gains in software performance which are also analyzed.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378098
J. Bajard, Sylvanus Kla, J. Muller
An algorithm for computing complex logarithms and exponentials is proposed. The algorithm is based on shift-and-add elementary steps, and it generalizes the Cordic algorithm. It can compute the usual real elementary functions. This algorithm is more suitable for computations in a redundant number system than Cordic, since there is no scaling factor for computation of trigonometric functions.<>
{"title":"BKM: A new hardware algorithm for complex elementary functions","authors":"J. Bajard, Sylvanus Kla, J. Muller","doi":"10.1109/ARITH.1993.378098","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378098","url":null,"abstract":"An algorithm for computing complex logarithms and exponentials is proposed. The algorithm is based on shift-and-add elementary steps, and it generalizes the Cordic algorithm. It can compute the usual real elementary functions. This algorithm is more suitable for computations in a redundant number system than Cordic, since there is no scaling factor for computation of trigonometric functions.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123105165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378083
N. Takagi
An algorithm for multiple-precision modular multiplication is proposed. In the algorithm, the upper half triangle of the whole partial products is first added up, and then the residue of the sum is calculated. Next, the sum of the lower half triangle of the whole partial products is added to the residue, and then the residue of the total amount is calculated. An efficient procedure for residue calculation that accelerates the algorithm is also proposed. Since it is both fast and uses a small amount of main memory, the algorithm is efficient for implementation on small computers, such as card computers, and is useful for application of a public-key cryptosystem to such computers.<>
{"title":"A modular multiplication algorithm with triangle additions","authors":"N. Takagi","doi":"10.1109/ARITH.1993.378083","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378083","url":null,"abstract":"An algorithm for multiple-precision modular multiplication is proposed. In the algorithm, the upper half triangle of the whole partial products is first added up, and then the residue of the sum is calculated. Next, the sum of the lower half triangle of the whole partial products is added to the residue, and then the residue of the total amount is calculated. An efficient procedure for residue calculation that accelerates the algorithm is also proposed. Since it is both fast and uses a small amount of main memory, the algorithm is efficient for implementation on small computers, such as card computers, and is useful for application of a public-key cryptosystem to such computers.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114214968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378104
Debjit Das Sarma, D. Matula
It is proved that a conventional ROM reciprocal table construction algorithm generates tables that minimize the relative error. The worst case relative errors realized for such optimally computed k-bits-in, m-bits-out ROM reciprocal tables are then determined for all table sizes 3 /spl les/ k, m /spl les/ 12. It is then proved that the table construction algorithm always generates a k-bits-in, k-bits-out table with relative errors never any greater than 3(2/sup -k/)/4 for any k, and, more generally with g guard bits, that for (k + g)-bits-out the relative error is never any greater than 2/sup -(k+1)/(1 + 1/(2/sup g+1/)). To provide for determining test data without prior construction of a full ROM reciprocal table, a procedure that requires generation and searching of only a small portion of such a table to determine regions containing input data yielding the worst case relative errors is described.<>
{"title":"Measuring the accuracy of ROM reciprocal tables","authors":"Debjit Das Sarma, D. Matula","doi":"10.1109/ARITH.1993.378104","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378104","url":null,"abstract":"It is proved that a conventional ROM reciprocal table construction algorithm generates tables that minimize the relative error. The worst case relative errors realized for such optimally computed k-bits-in, m-bits-out ROM reciprocal tables are then determined for all table sizes 3 /spl les/ k, m /spl les/ 12. It is then proved that the table construction algorithm always generates a k-bits-in, k-bits-out table with relative errors never any greater than 3(2/sup -k/)/4 for any k, and, more generally with g guard bits, that for (k + g)-bits-out the relative error is never any greater than 2/sup -(k+1)/(1 + 1/(2/sup g+1/)). To provide for determining test data without prior construction of a full ROM reciprocal table, a procedure that requires generation and searching of only a small portion of such a table to determine regions containing input data yielding the worst case relative errors is described.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"355 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125640221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378107
L. Dadda, V. Piuri, R. Stefanelli
A scheme for a convolver design, called a multiparallel convolver, that is based on concurrent processing of p adjacent samples that are input simultaneously to the p-parallel convolver is presented. The scheme uses p units, each of which receives the input samples and produces one convolution every p samples; these are called p-phase subconvolvers. The detailed design of the p-phase subconvolvers and of the whole p-parallel convolver is presented and discussed. The scheme can be used for both the bit-parallel and the bit-serial input presentation of each sample. The input sample rate of the p-parallel convolver is p times the sample rate of a standard (1-parallel) convolver implemented using the same integration technology. The number of components required by a p-parallel convolver is approximately p times the number of components required by a standard convolver.<>
{"title":"Multi-parallel convolvers","authors":"L. Dadda, V. Piuri, R. Stefanelli","doi":"10.1109/ARITH.1993.378107","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378107","url":null,"abstract":"A scheme for a convolver design, called a multiparallel convolver, that is based on concurrent processing of p adjacent samples that are input simultaneously to the p-parallel convolver is presented. The scheme uses p units, each of which receives the input samples and produces one convolution every p samples; these are called p-phase subconvolvers. The detailed design of the p-phase subconvolvers and of the whole p-parallel convolver is presented and discussed. The scheme can be used for both the bit-parallel and the bit-serial input presentation of each sample. The input sample rate of the p-parallel convolver is p times the sample rate of a standard (1-parallel) convolver implemented using the same integration technology. The number of components required by a p-parallel convolver is approximately p times the number of components required by a standard convolver.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132059836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378088
W. Krandick, Jeremy R. Johnson
An algorithm is described for multiplying multiprecision floating-point numbers. The algorithm can produce either the smallest floating-point number greater than or equal to the true product, or the greatest floating-point number smaller than or equal to the true product. Software implementations of multiprecision floating-point multiplication can reduce the computation time by a factor of two if they do not compute the low-order digits of the product of the two mantissas. However, these algorithms do not necessarily provide optimally rounded results. The algorithms described here is guaranteed to produce optimally rounded results and typically obtains the same savings.<>
{"title":"Efficient multiprecision floating point multiplication with optimal directional rounding","authors":"W. Krandick, Jeremy R. Johnson","doi":"10.1109/ARITH.1993.378088","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378088","url":null,"abstract":"An algorithm is described for multiplying multiprecision floating-point numbers. The algorithm can produce either the smallest floating-point number greater than or equal to the true product, or the greatest floating-point number smaller than or equal to the true product. Software implementations of multiprecision floating-point multiplication can reduce the computation time by a factor of two if they do not compute the low-order digits of the product of the two mantissas. However, these algorithms do not necessarily provide optimally rounded results. The algorithms described here is guaranteed to produce optimally rounded results and typically obtains the same savings.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129803783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378113
P. Turner
The extension of the SLI (symmetric level index) system to complex numbers and arithmetic is discussed. The natural form for representation of complex quantities in SLI is in the modulus-argument form, and this can be sensibly packed into a single 64-b word for the equivalent of the 32-b real SLI representation. The arithmetic algorithms prove to be very slightly more complicated than for real SLI arithmetic. The representation, the arithmetic algorithms, and the control of errors within these algorithms are described.<>
{"title":"Complex SLI arithmetic: Representation, algorithms and analysis","authors":"P. Turner","doi":"10.1109/ARITH.1993.378113","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378113","url":null,"abstract":"The extension of the SLI (symmetric level index) system to complex numbers and arithmetic is discussed. The natural form for representation of complex quantities in SLI is in the modulus-argument form, and this can be sensibly packed into a single 64-b word for the equivalent of the 32-b real SLI representation. The arithmetic algorithms prove to be very slightly more complicated than for real SLI arithmetic. The representation, the arithmetic algorithms, and the control of errors within these algorithms are described.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123831045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-06-29DOI: 10.1109/ARITH.1993.378097
J. Mellott, Jermy C. Smith, F. Taylor
The Gauss machine is a SIMD systolic array architecture that takes advantage of the Galois-enhanced residue number system (GEQRNS) to form reduced-complexity arithmetic elements. The Gauss machine is targeted at front-end signal and image processing applications. A discrete prototype that achieves a peak rating of 320 million complex arithmetic operations per second while operating at 10 MHz has been constructed. A VLSI implementation of the Gauss machine's processor cell has been created. The VLSI implementation is implemented in 2.0-/spl mu/m CMOS and achieves greater than 20-MHz performance, using less than 2.0-mm/sup 2/ die area. It is shown that techniques for defect tolerance in RNS systolic arrays can result in substantial yield enhancement, thereby making larger than conventional (ULSI) systems possible.<>
{"title":"The Gauss machine: A Galois-enhanced quadratic residue number system systolic array","authors":"J. Mellott, Jermy C. Smith, F. Taylor","doi":"10.1109/ARITH.1993.378097","DOIUrl":"https://doi.org/10.1109/ARITH.1993.378097","url":null,"abstract":"The Gauss machine is a SIMD systolic array architecture that takes advantage of the Galois-enhanced residue number system (GEQRNS) to form reduced-complexity arithmetic elements. The Gauss machine is targeted at front-end signal and image processing applications. A discrete prototype that achieves a peak rating of 320 million complex arithmetic operations per second while operating at 10 MHz has been constructed. A VLSI implementation of the Gauss machine's processor cell has been created. The VLSI implementation is implemented in 2.0-/spl mu/m CMOS and achieves greater than 20-MHz performance, using less than 2.0-mm/sup 2/ die area. It is shown that techniques for defect tolerance in RNS systolic arrays can result in substantial yield enhancement, thereby making larger than conventional (ULSI) systems possible.<<ETX>>","PeriodicalId":414758,"journal":{"name":"Proceedings of IEEE 11th Symposium on Computer Arithmetic","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1993-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132781167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}