Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465354
M. Schulte, E. Swartzlander
This paper presents the hardware design and arithmetic algorithms for a coprocessor that performs variable-precision, interval arithmetic. The coprocessor gives the programmer the ability to specify the precision of the computation, determine the accuracy of the result, and recompute inaccurate results with higher precision. Direct hardware support and efficient algorithms for variable-precision, interval arithmetic greatly improve the speed, accuracy, and reliability of numerical computations. Performance estimates indicate that the coprocessor is 200 to 1,000 times faster than a software package for variable-precision, interval arithmetic. The coprocessor can be implemented on a single chip with a cycle time that is comparable to IEEE double-precision floating point coprocessors.<>
{"title":"Hardware design and arithmetic algorithms for a variable-precision, interval arithmetic coprocessor","authors":"M. Schulte, E. Swartzlander","doi":"10.1109/ARITH.1995.465354","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465354","url":null,"abstract":"This paper presents the hardware design and arithmetic algorithms for a coprocessor that performs variable-precision, interval arithmetic. The coprocessor gives the programmer the ability to specify the precision of the computation, determine the accuracy of the result, and recompute inaccurate results with higher precision. Direct hardware support and efficient algorithms for variable-precision, interval arithmetic greatly improve the speed, accuracy, and reliability of numerical computations. Performance estimates indicate that the coprocessor is 200 to 1,000 times faster than a software package for variable-precision, interval arithmetic. The coprocessor can be implemented on a single chip with a cycle time that is comparable to IEEE double-precision floating point coprocessors.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133224279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465366
Peter Soderquist, M. Leeser
The implementations of division and square root in the FPU's of current microprocessors are based on one of two categories of algorithms. Multiplicative techniques, exemplified by the Newton-Raphson method and Goldschmidt's algorithm, share functionality with the floating-point multiplier. Subtractive methods, such as the many variations of radix-4 SRT, generally use dedicated, parallel hardware. These different approaches give rise to the distinct area and performance characteristics which are explored in this paper. Area comparisons are derived from measurements of commercial and academic hardware implementations. Representative divide/square root implementations are paired with typical add-multiply structures and simulated, using data from current microprocessor and arithmetic coprocessor designs, to obtain performance estimates. The results suggest that subtractive implementations offer a superior balance of area and performance, and stand to benefit most decisively from improvements in technology and growing transistor budgets due to their parallel operation. Multiplicative methods lend themselves best to situations where hardware re-use is mandated due to area or architectural constraints.<>
{"title":"An area/performance comparison of subtractive and multiplicative divide/square root implementations","authors":"Peter Soderquist, M. Leeser","doi":"10.1109/ARITH.1995.465366","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465366","url":null,"abstract":"The implementations of division and square root in the FPU's of current microprocessors are based on one of two categories of algorithms. Multiplicative techniques, exemplified by the Newton-Raphson method and Goldschmidt's algorithm, share functionality with the floating-point multiplier. Subtractive methods, such as the many variations of radix-4 SRT, generally use dedicated, parallel hardware. These different approaches give rise to the distinct area and performance characteristics which are explored in this paper. Area comparisons are derived from measurements of commercial and academic hardware implementations. Representative divide/square root implementations are paired with typical add-multiply structures and simulated, using data from current microprocessor and arithmetic coprocessor designs, to obtain performance estimates. The results suggest that subtractive implementations offer a superior balance of area and performance, and stand to benefit most decisively from improvements in technology and growing transistor budgets due to their parallel operation. Multiplicative methods lend themselves best to situations where hardware re-use is mandated due to area or architectural constraints.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127317304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465375
A. Houelle, H. Mehrez, N. Vaucher, L. Montalvo, A. Guyot
Experience has shown that generator programs are quite often written by VLSI designers, as they hold the empirical knowledge better than anyone. However, their ability does not necessarily include programming and debugging skills: these designers have to focus on the problem at hand not on the tools or the language they use to solve it. GenOptim has been created to quickly design efficient IEEE 754 floating-point macro-cell generators that do not rely on particular target technologies. Whereas the design of fast and efficient adders, multipliers and shifters is well understood division and square root remain a serious design challenge. GenOptim was used to quickly evaluate new divider architectures.<>
{"title":"Application of fast layout synthesis environment to dividers evaluation","authors":"A. Houelle, H. Mehrez, N. Vaucher, L. Montalvo, A. Guyot","doi":"10.1109/ARITH.1995.465375","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465375","url":null,"abstract":"Experience has shown that generator programs are quite often written by VLSI designers, as they hold the empirical knowledge better than anyone. However, their ability does not necessarily include programming and debugging skills: these designers have to focus on the problem at hand not on the tools or the language they use to solve it. GenOptim has been created to quickly design efficient IEEE 754 floating-point macro-cell generators that do not rely on particular target technologies. Whereas the design of fast and efficient adders, multipliers and shifters is well understood division and square root remain a serious design challenge. GenOptim was used to quickly evaluate new divider architectures.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115127908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465353
D. Michelucci
Symbolic perturbation by infinitely small values removes degeneracies in geometric algorithms and enables programmers to handle only generic cases: there are a few such cases, whereas there are an overwhelming number of degenerate cases. Current perturbation schemes have limitations. To overcome them, the paper proposes to use an /spl epsiv/-arithmetic, i.e. to represent in an explicit way infinitely small numbers and to define arithmetic operations (+,-,*,/,<,=) on them.<>
{"title":"An /spl epsiv/ arithmetic for removing degeneracies","authors":"D. Michelucci","doi":"10.1109/ARITH.1995.465353","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465353","url":null,"abstract":"Symbolic perturbation by infinitely small values removes degeneracies in geometric algorithms and enables programmers to handle only generic cases: there are a few such cases, whereas there are an overwhelming number of degenerate cases. Current perturbation schemes have limitations. To overcome them, the paper proposes to use an /spl epsiv/-arithmetic, i.e. to represent in an explicit way infinitely small numbers and to define arithmetic operations (+,-,*,/,<,=) on them.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133686576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465365
T. Coe, P. T. P. Tang
The initial release of the Pentium processor has a flaw in its radix-4 SRT division implementation. It is widely-known that five entries were missing in the lookup table, yielding reduced-precision quotients occasionally. In this paper, we use mathematical techniques to analyze the divisors that can possibly cause failures. In particular, we show that Bits 5 through 10 (where Bit 0 is the MSB) of such divisors must be all ones. This result is useful in compiler-level software patches for systems with unreplaced chips; and we believe that the techniques used here are applicable in analyzing SRT division as well as other hardware algorithms for floating-point arithmetic.<>
{"title":"It takes six ones to reach a flaw [Pentium processor]","authors":"T. Coe, P. T. P. Tang","doi":"10.1109/ARITH.1995.465365","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465365","url":null,"abstract":"The initial release of the Pentium processor has a flaw in its radix-4 SRT division implementation. It is widely-known that five entries were missing in the lookup table, yielding reduced-precision quotients occasionally. In this paper, we use mathematical techniques to analyze the divisors that can possibly cause failures. In particular, we show that Bits 5 through 10 (where Bit 0 is the MSB) of such divisors must be all ones. This result is useful in compiler-level software patches for systems with unreplaced chips; and we believe that the techniques used here are applicable in analyzing SRT division as well as other hardware algorithms for floating-point arithmetic.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125391645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465364
R. Yu, G. Zyner
An IEEE floating point multiplier with partial support for subnormal operands and results is presented. Radix-4 or modified Booth encoding and a binary tree of 4:2 compressors are used to generate the 53/spl times/53 double-precision product. Delay matching techniques were used in the binary tree stage and in the final addition stage to reduce cycle time. New techniques in rounding and sticky-bit generation were also used to reduce area and timing. The overall multiplier has a latency of 3 cycles a throughput of 1 cycle, and a cycle time of 6.0 ns. This multiplier has been implemented in a 0.5 /spl mu/m static CMOS technology in the UltraSPARC RISC microprocessor.<>
{"title":"167 MHz radix-4 floating point multiplier","authors":"R. Yu, G. Zyner","doi":"10.1109/ARITH.1995.465364","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465364","url":null,"abstract":"An IEEE floating point multiplier with partial support for subnormal operands and results is presented. Radix-4 or modified Booth encoding and a binary tree of 4:2 compressors are used to generate the 53/spl times/53 double-precision product. Delay matching techniques were used in the binary tree stage and in the final addition stage to reduce cycle time. New techniques in rounding and sticky-bit generation were also used to reduce area and timing. The overall multiplier has a latency of 3 cycles a throughput of 1 cycle, and a cycle time of 6.0 ns. This multiplier has been implemented in a 0.5 /spl mu/m static CMOS technology in the UltraSPARC RISC microprocessor.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117321305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465370
V. Jain, L. Lin
High-speed coprocessors for computing nonlinear functions are important for advanced scientific computing as well as real-time image processing. In this paper we develop an efficient interpolative approach to such coprocessors. Performed on suitable subintervals of the range of interest, our interpolation which uses third degree polynomial is adequate for many elementary functions of interest with double precision mantissas. Our method requires only one major multiplication, two minor multiplications and a few additions. The minor multiplications are for the second and third degree terms, and their significant bits are much fewer than those of the first degree term. This leads to a very fast and efficient VLSI architecture for such coprocessors. It appears that polynomial based interpolation can yield considerable benefits over previously used approaches, when execution time and silicon area are considered. Further, it is possible to combine the computation of multiple functions on a single chip, with most of the resources on the chip shared for several functions.<>
{"title":"High-speed double precision computation of nonlinear functions","authors":"V. Jain, L. Lin","doi":"10.1109/ARITH.1995.465370","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465370","url":null,"abstract":"High-speed coprocessors for computing nonlinear functions are important for advanced scientific computing as well as real-time image processing. In this paper we develop an efficient interpolative approach to such coprocessors. Performed on suitable subintervals of the range of interest, our interpolation which uses third degree polynomial is adequate for many elementary functions of interest with double precision mantissas. Our method requires only one major multiplication, two minor multiplications and a few additions. The minor multiplications are for the second and third degree terms, and their significant bits are much fewer than those of the first degree term. This leads to a very fast and efficient VLSI architecture for such coprocessors. It appears that polynomial based interpolation can yield considerable benefits over previously used approaches, when execution time and silicon area are considered. Further, it is possible to combine the computation of multiple functions on a single chip, with most of the resources on the chip shared for several functions.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126591271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465374
M. Flynn, K. Nowka, G. Bewick, E. Schwarz, Nhon T. Quach
SNAP-the Stanford subnanosecond arithmetic processor-is an interdisciplinary effort to develop theory, tools, and technology for realizing an arithmetic processor with execution rates under 1 ns. Specific improvements in clocking methods, floating-point addition algorithms, floating-point multiplication algorithms, division and higher-level function algorithms, design tools, and packaging technology were studied. These improvements have been demonstrated in the implementation of several VLSI designs.<>
{"title":"The SNAP project: towards sub-nanosecond arithmetic","authors":"M. Flynn, K. Nowka, G. Bewick, E. Schwarz, Nhon T. Quach","doi":"10.1109/ARITH.1995.465374","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465374","url":null,"abstract":"SNAP-the Stanford subnanosecond arithmetic processor-is an interdisciplinary effort to develop theory, tools, and technology for realizing an arithmetic processor with execution rates under 1 ns. Specific improvements in clocking methods, floating-point addition algorithms, floating-point multiplication algorithms, division and higher-level function algorithms, design tools, and packaging technology were studied. These improvements have been demonstrated in the implementation of several VLSI designs.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127730874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465381
Debjit Das Sarma, D. Matula
We describe bipartite reciprocal tables that employ separate table lookup of the positive and negative portions of a borrow-save reciprocal value. The fusion of the parts includes a rounding so the output reciprocals are guaranteed correct to a unit in the last place, and typically provide a round-to-nearest reciprocal for over 90% of input arguments. The output rounding can be accomplished in conjunction with multiplier recoding yielding practically no cost in logic complexity or time in employing bipartite tables. We demonstrate these tables to be 2 to 4 times smaller than conventional 4-bit reciprocal tables. For 10-16 bit reciprocal table lookup the compression grows from a factor of 4 to over 16, making possible the use of larger seed reciprocals than previously considered cost effective.<>
{"title":"Faithful bipartite ROM reciprocal tables","authors":"Debjit Das Sarma, D. Matula","doi":"10.1109/ARITH.1995.465381","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465381","url":null,"abstract":"We describe bipartite reciprocal tables that employ separate table lookup of the positive and negative portions of a borrow-save reciprocal value. The fusion of the parts includes a rounding so the output reciprocals are guaranteed correct to a unit in the last place, and typically provide a round-to-nearest reciprocal for over 90% of input arguments. The output rounding can be accomplished in conjunction with multiplier recoding yielding practically no cost in logic complexity or time in employing bipartite tables. We demonstrate these tables to be 2 to 4 times smaller than conventional 4-bit reciprocal tables. For 10-16 bit reciprocal table lookup the compression grows from a factor of 4 to over 16, making possible the use of larger seed reciprocals than previously considered cost effective.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131227199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-07-19DOI: 10.1109/ARITH.1995.465361
Feng Zhou, Peter Kornerup
This paper describes DCT (IDCT) computations using the CORDIC algorithm. By rewriting the DCT, for a 1/spl times/8 DCT only 6 CORDIC computations are needed, whereas a 1/spl times/16 DCT requires 22 CORDIC computations. But these can all be pipelined through a single CORDIC unit, so 16/spl times/16 DCT's becomes feasible for HDTV compression. Only some simple adders, registers and a more complicated carry look-ahead adder are needed, end the computing speed can be very high. Limited only by the delay of a carry look-ahead adder, the delay time of the pipelined structure is 2-10 ns and the data rate as 100-500 MHz for an 8/spl times/8 DCT/IDCT and 72.2-366.6 MHz for a 16/spl times/16 DCT/IDCT when using two units.<>
{"title":"High speed DCT/IDCT using a pipelined CORDIC algorithm","authors":"Feng Zhou, Peter Kornerup","doi":"10.1109/ARITH.1995.465361","DOIUrl":"https://doi.org/10.1109/ARITH.1995.465361","url":null,"abstract":"This paper describes DCT (IDCT) computations using the CORDIC algorithm. By rewriting the DCT, for a 1/spl times/8 DCT only 6 CORDIC computations are needed, whereas a 1/spl times/16 DCT requires 22 CORDIC computations. But these can all be pipelined through a single CORDIC unit, so 16/spl times/16 DCT's becomes feasible for HDTV compression. Only some simple adders, registers and a more complicated carry look-ahead adder are needed, end the computing speed can be very high. Limited only by the delay of a carry look-ahead adder, the delay time of the pipelined structure is 2-10 ns and the data rate as 100-500 MHz for an 8/spl times/8 DCT/IDCT and 72.2-366.6 MHz for a 16/spl times/16 DCT/IDCT when using two units.<<ETX>>","PeriodicalId":332829,"journal":{"name":"Proceedings of the 12th Symposium on Computer Arithmetic","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1995-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129215939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}