Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145553
Thomas W. Lynch, E. Swartzlander
The design of the 56-b significand adder for the Advanced Micro Devices, Am29050 microprocessor, is described. This is a 1- mu m design rule CMOS realization of a high-performance RISC (reduced instruction set computer) microprocessor that implements IEEE Standard 754 floating-point arithmetic. To achieve an add time of under 4 ns for the 56-b significand and to avoid multistage pipelines which significantly impair compiler efficiency, a redundant cell adder has been developed. This redundant cell adder design combines carry lookahead adders realized with Manchester carry chains and the carry select adder concept to achieve approximately twice the speed of the traditional carry lookahead adder. This adder achieves a 3.2-ns measured add time for 56-bit operands and is of reasonable size.<>
{"title":"The redundant cell adder","authors":"Thomas W. Lynch, E. Swartzlander","doi":"10.1109/ARITH.1991.145553","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145553","url":null,"abstract":"The design of the 56-b significand adder for the Advanced Micro Devices, Am29050 microprocessor, is described. This is a 1- mu m design rule CMOS realization of a high-performance RISC (reduced instruction set computer) microprocessor that implements IEEE Standard 754 floating-point arithmetic. To achieve an add time of under 4 ns for the 56-b significand and to avoid multistage pipelines which significantly impair compiler efficiency, a redundant cell adder has been developed. This redundant cell adder design combines carry lookahead adders realized with Manchester carry chains and the carry select adder concept to achieve approximately twice the speed of the traditional carry lookahead adder. This adder achieves a 3.2-ns measured add time for 56-bit operands and is of reasonable size.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117065698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145539
N. Wigley, G. Jullien, Daniel Reaume, W. Miller
The authors describe mapping, scaling, and conversion processes using a new mapping strategy for the modulus replication residue number system (MRRNS). The strategy allows direct mapping of bits of either a purely real or multiplexed bit coded complex number to a set of independent rings, defined by moduli 3, 5, and 7. The MRRNS technique is superior to a large QRNS system operating with a computational dynamic range of over 27 b. A classical radix-4 implementation of a 1024 FFT is used for the comparison. The scaling and conversion procedure is shown to be a set of finite ring calculations followed by an array of ordinary binary adders. The VLSI implementation of the most complex finite ring circuit required (a Mod 7 multiplier) is shown to be easily implemented using the switching tree approach, and mask extracted simulations at 50 MHz demonstrate the embedding of the switching trees in a dynamic pipeline/evaluate circuit with restoring latch.<>
{"title":"Small moduli replications in the MRRNS","authors":"N. Wigley, G. Jullien, Daniel Reaume, W. Miller","doi":"10.1109/ARITH.1991.145539","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145539","url":null,"abstract":"The authors describe mapping, scaling, and conversion processes using a new mapping strategy for the modulus replication residue number system (MRRNS). The strategy allows direct mapping of bits of either a purely real or multiplexed bit coded complex number to a set of independent rings, defined by moduli 3, 5, and 7. The MRRNS technique is superior to a large QRNS system operating with a computational dynamic range of over 27 b. A classical radix-4 implementation of a 1024 FFT is used for the comparison. The scaling and conversion procedure is shown to be a set of finite ring calculations followed by an array of ordinary binary adders. The VLSI implementation of the most complex finite ring circuit required (a Mod 7 multiplier) is shown to be easily implemented using the switching tree approach, and mask extracted simulations at 50 MHz demonstrate the embedding of the switching trees in a dynamic pipeline/evaluate circuit with restoring latch.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127875939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145570
Jeong-A Lee, T. Lang
A constant-factor-redundant-CORDIC (CFR-CORDIC) scheme is developed where the scale factor is forced to be constant while computing angles for SVD (singular value decomposition). Based on the scheme, a fixed-point implementation of SVD is presented with the following additional features: (1) the final scaling operation is done by shifting; (2) the number of iterations in the CORDIC rotation unit is reduced by about 25% by expressing the direction of the rotation in radix-2 and radix-4; and (3) the conventional number representation of rotated output is obtained on-the-fly, not from a carry-propagate adder. The authors compare this scheme with previously proposed ones and show that it provides an execution time similar to that of redundant CORDIC with variable scaling factor, with significant saving in area.<>
{"title":"SVD by constant-factor-redundant-CORDIC","authors":"Jeong-A Lee, T. Lang","doi":"10.1109/ARITH.1991.145570","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145570","url":null,"abstract":"A constant-factor-redundant-CORDIC (CFR-CORDIC) scheme is developed where the scale factor is forced to be constant while computing angles for SVD (singular value decomposition). Based on the scheme, a fixed-point implementation of SVD is presented with the following additional features: (1) the final scaling operation is done by shifting; (2) the number of iterations in the CORDIC rotation unit is reduced by about 25% by expressing the direction of the rotation in radix-2 and radix-4; and (3) the conventional number representation of rotated output is obtained on-the-fly, not from a carry-propagate adder. The authors compare this scheme with previously proposed ones and show that it provides an execution time similar to that of redundant CORDIC with variable scaling factor, with significant saving in area.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127398153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145547
P. Turner
Extended arithmetic operations, such as forming scalar products, in symmetric level index (SLI) arithmetic are considered. Schemes for the implementation of such algorithms are described and analyzed in terms of comparative timings for these operations and their floating-point counterparts and in terms of the control of errors in the computation. With sufficient parallelism available in the SLI processor, the computation can be as fast as for floating-point operations. The SLI operation can be modified to produce just a single rounding error from extended operations very economically. The implementation details suggest that any time-penalty associated with the use of SLI arithmetic can be kept to a very small factor on highly parallel computers, perhaps on the order of just two or three for typical scientific computing programs.<>
{"title":"Implementation and analysis of extended SLI operations","authors":"P. Turner","doi":"10.1109/ARITH.1991.145547","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145547","url":null,"abstract":"Extended arithmetic operations, such as forming scalar products, in symmetric level index (SLI) arithmetic are considered. Schemes for the implementation of such algorithms are described and analyzed in terms of comparative timings for these operations and their floating-point counterparts and in terms of the control of errors in the computation. With sufficient parallelism available in the SLI processor, the computation can be as fast as for floating-point operations. The SLI operation can be modified to produce just a single rounding error from extended operations very economically. The implementation details suggest that any time-penalty associated with the use of SLI arithmetic can be kept to a very small factor on highly parallel computers, perhaps on the order of just two or three for typical scientific computing programs.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115439551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145569
Shen-Fu Hsiao, J. Delosme
A novel n-dimensional (n-D) CORDIC algorithm for Euclidean and pseudo-Euclidean rotations is proposed. This algorithm is closely related to Householder transformations. It is shown to converge faster than CORDIC algorithms developed earlier for n=3 and 4. Processor architectures for the algorithm are presented. The area and time performance of n-D CORDIC processors are evaluated. For a comparable time performance, the processors require significantly less area than parallel Householder processors. Furthermore, arrays of n-D Euclidean CORDIC processors are shown to speed up the QR decomposition of rectangular matrices by a factor of n-1 in comparison with a 2-D CORDIC processor array.<>
{"title":"The CORDIC Householder algorithm","authors":"Shen-Fu Hsiao, J. Delosme","doi":"10.1109/ARITH.1991.145569","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145569","url":null,"abstract":"A novel n-dimensional (n-D) CORDIC algorithm for Euclidean and pseudo-Euclidean rotations is proposed. This algorithm is closely related to Householder transformations. It is shown to converge faster than CORDIC algorithms developed earlier for n=3 and 4. Processor architectures for the algorithm are presented. The area and time performance of n-D CORDIC processors are evaluated. For a comparable time performance, the processors require significantly less area than parallel Householder processors. Furthermore, arrays of n-D Euclidean CORDIC processors are shown to speed up the QR decomposition of rectangular matrices by a factor of n-1 in comparison with a 2-D CORDIC processor array.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126012898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145555
L. Kuhnel
The author introduces a purely systolic hardware algorithm for addition which is based on a mesh-connected arrangement of cells. The proposed FASTA algorithm is well suited for realization in integrated technologies. Its area, computation time, and period satisfy A(n)=O(n), T(n)=O( square root n), and P(n)=O( square root n), respectively, where n denotes the operand length. Therefore, this adder is T-, APT-, and AT/sup 2/-optimal in the linear model for signal propagation delays. In the class of Theta ( square root n) time adders it is optimal with respect to A, P, T, AT, APT, AP/sup 2/, and AT/sup 2/. The suggested algorithm essentially is a solution to the general problem of parallel prefix computation. Therefore, it can serve as a paradigm for the design of optimal purely systolic hardware algorithms in a wide range of application domains.<>
{"title":"Optimal purely systolic addition","authors":"L. Kuhnel","doi":"10.1109/ARITH.1991.145555","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145555","url":null,"abstract":"The author introduces a purely systolic hardware algorithm for addition which is based on a mesh-connected arrangement of cells. The proposed FASTA algorithm is well suited for realization in integrated technologies. Its area, computation time, and period satisfy A(n)=O(n), T(n)=O( square root n), and P(n)=O( square root n), respectively, where n denotes the operand length. Therefore, this adder is T-, APT-, and AT/sup 2/-optimal in the linear model for signal propagation delays. In the class of Theta ( square root n) time adders it is optimal with respect to A, P, T, AT, APT, AP/sup 2/, and AT/sup 2/. The suggested algorithm essentially is a solution to the general problem of parallel prefix computation. Therefore, it can serve as a paradigm for the design of optimal purely systolic hardware algorithms in a wide range of application domains.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"427 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115654719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145556
J. Vuillemin
The author introduces a synchronous binary counter which can be operated under a high clock frequency, independent of the counter's length n: all signals traverse at most two three-input logic gates during each clock phase. The proposed design is simple enough to have practical implications, as illustrated by a CMOS programmable gate array implementation which has counted up to 2/sup 40/ with a 40-MHz clock. The area required for laying out this design is no larger than that of the (much slower) carry-ripple counter.<>
{"title":"Constant time arbitrary length synchronous binary counters","authors":"J. Vuillemin","doi":"10.1109/ARITH.1991.145556","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145556","url":null,"abstract":"The author introduces a synchronous binary counter which can be operated under a high clock frequency, independent of the counter's length n: all signals traverse at most two three-input logic gates during each clock phase. The proposed design is simple enough to have practical implications, as illustrated by a CMOS programmable gate array implementation which has counted up to 2/sup 40/ with a 40-MHz clock. The area required for laying out this design is no larger than that of the (much slower) carry-ripple counter.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"151 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133657196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145565
P. T. P. Tang
Table-lookup algorithms for calculating elementary functions offer superior speed and accuracy when compared with more traditional algorithms. It is shown that, with careful design, it is feasible to implement table-lookup algorithms in hardware. A uniform approach for carrying out a tight error analysis for such implementations is presented. The advantages of table-lookup algorithms over CORDIC and ordinary (without table-lookup) polynomial algorithms are described.<>
{"title":"Table-lookup algorithms for elementary functions and their error analysis","authors":"P. T. P. Tang","doi":"10.1109/ARITH.1991.145565","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145565","url":null,"abstract":"Table-lookup algorithms for calculating elementary functions offer superior speed and accuracy when compared with more traditional algorithms. It is shown that, with careful design, it is feasible to implement table-lookup algorithms in hardware. A uniform approach for carrying out a tight error analysis for such implementations is presented. The advantages of table-lookup algorithms over CORDIC and ordinary (without table-lookup) polynomial algorithms are described.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133205887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145532
Mayur Mehta, Vijay Parmar, E. Swartzlander
The design of a fast multiplier implemented using either
使用任意一种实现的快速乘法器设计
{"title":"High-speed multiplier design using multi-input counter and compressor circuits","authors":"Mayur Mehta, Vijay Parmar, E. Swartzlander","doi":"10.1109/ARITH.1991.145532","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145532","url":null,"abstract":"The design of a fast multiplier implemented using either","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131200934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-06-26DOI: 10.1109/ARITH.1991.145533
Holger Orup, Peter Kornerup
In a class of cryptosystems, fast computation of modulo exponentials is essential. The authors present a parallel version of a well-known exponentiation algorithm that halves the worst-case computing time. It is described how a high radix modulo multiplication can be implemented by interleaving a serial-parallel multiplication scheme with an SRT division scheme. The problems associated with high radices are efficiently solved by the use of a redundant representation of intermediate operands. It is shown how the algorithms can be realized as a highly regular VLSI circuit. Simulations indicate that a radix 32 implementation of the algorithms is capable of computing 512-b operand exponentials in 3.2 ms. This is more than five times faster than other known implementations.<>
{"title":"A high-radix hardware algorithm for calculating the exponential M/sup E/ modulo N","authors":"Holger Orup, Peter Kornerup","doi":"10.1109/ARITH.1991.145533","DOIUrl":"https://doi.org/10.1109/ARITH.1991.145533","url":null,"abstract":"In a class of cryptosystems, fast computation of modulo exponentials is essential. The authors present a parallel version of a well-known exponentiation algorithm that halves the worst-case computing time. It is described how a high radix modulo multiplication can be implemented by interleaving a serial-parallel multiplication scheme with an SRT division scheme. The problems associated with high radices are efficiently solved by the use of a redundant representation of intermediate operands. It is shown how the algorithms can be realized as a highly regular VLSI circuit. Simulations indicate that a radix 32 implementation of the algorithms is capable of computing 512-b operand exponentials in 3.2 ms. This is more than five times faster than other known implementations.<<ETX>>","PeriodicalId":190650,"journal":{"name":"[1991] Proceedings 10th IEEE Symposium on Computer Arithmetic","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116917494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}