Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz
FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.
{"title":"FPU Generator for Design Space Exploration","authors":"Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz","doi":"10.1109/ARITH.2013.27","DOIUrl":"https://doi.org/10.1109/ARITH.2013.27","url":null,"abstract":"FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair \"apples to apples\" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129583913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we offer new algorithms for one of the most common operations in public key cryptosystems: the inversion over binary Galois fields. The new algorithms are based on using double-base and triple-base representations. They are provably more economical-in terms of the average number of multiplications-than the popular Itoh-Tsujii algorithm. In addition to having fewer multiplications, the new inversion algorithms offer further implementation advantages because they allow more efficient computation of squarings and, in some cases, require fewer temporary variables. The new algorithms are straightforwardly usable in both software and hardware implementations.
{"title":"Another Look at Inversions over Binary Fields","authors":"V. Dimitrov, K. Järvinen","doi":"10.1109/ARITH.2013.25","DOIUrl":"https://doi.org/10.1109/ARITH.2013.25","url":null,"abstract":"In this paper we offer new algorithms for one of the most common operations in public key cryptosystems: the inversion over binary Galois fields. The new algorithms are based on using double-base and triple-base representations. They are provably more economical-in terms of the average number of multiplications-than the popular Itoh-Tsujii algorithm. In addition to having fewer multiplications, the new inversion algorithms offer further implementation advantages because they allow more efficient computation of squarings and, in some cases, require fewer temporary variables. The new algorithms are straightforwardly usable in both software and hardware implementations.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131162825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent studies have demonstrated the importance of protecting the hardware implementations of cryptographic functions against side channel and fault attacks. In last years, very efficient implementations of modular arithmetic have been done in RNS (RSA, ECC, pairings) as well on FPGA as on GPU. Thus the protection of RNS Montgomery modular multiplication is a crucial issue. For that purpose, some techniques have been proposed to protect this RNS operation against side channel analysis. Nevertheless, there are still no effective and generic approaches for the detection of fault injection, which would be additionnally compatible with a leak resistant arithmetic. This paper proposes a new RNS Montgomery multiplication algorithm with fault detection capability. A mathematical analysis demonstrates the validity of the proposed approach. Moreover, an architecture that implements the proposed algorithm is presented. A comparative analysis shows that the introduction of the proposed fault detection technique requires only a limited increase in area.
{"title":"Fault Detection in RNS Montgomery Modular Multiplication","authors":"J. Bajard, J. Eynard, F. Gandino","doi":"10.1109/ARITH.2013.31","DOIUrl":"https://doi.org/10.1109/ARITH.2013.31","url":null,"abstract":"Recent studies have demonstrated the importance of protecting the hardware implementations of cryptographic functions against side channel and fault attacks. In last years, very efficient implementations of modular arithmetic have been done in RNS (RSA, ECC, pairings) as well on FPGA as on GPU. Thus the protection of RNS Montgomery modular multiplication is a crucial issue. For that purpose, some techniques have been proposed to protect this RNS operation against side channel analysis. Nevertheless, there are still no effective and generic approaches for the detection of fault injection, which would be additionnally compatible with a leak resistant arithmetic. This paper proposes a new RNS Montgomery multiplication algorithm with fault detection capability. A mathematical analysis demonstrates the validity of the proposed approach. Moreover, an architecture that implements the proposed algorithm is presented. A comparative analysis shows that the introduction of the proposed fault detection technique requires only a limited increase in area.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123738805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scalar recoding is popular to speed up ECC scalar multiplication: non-adjacent form, double-base number system, multi-base number system. But fast recoding methods require pre-computations: multiples of base point or off-line conversion. In this paper, we present a multi-base recoding method for ECC scalar multiplication based on i) a greedy algorithm starting least significant terms first, ii) cheap divisibility tests by multi-base elements and iii) fast exact divisions by multi-base elements. Multi-base terms are obtained on-the-fly using a special recoding unit which operates in parallel to curve-level operations and at very high speed. This ensures that all recoding steps are performed fast enough to schedule the next curve-level operations without interruptions. The proposed method can be fully implemented in hardware without pre-computations. We report FPGA implementation details and very good performances compared to state-of-art results.
{"title":"On-the-Fly Multi-base Recoding for ECC Scalar Multiplication without Pre-computations","authors":"Thomas Chabrier, A. Tisserand","doi":"10.1109/arith.2013.17","DOIUrl":"https://doi.org/10.1109/arith.2013.17","url":null,"abstract":"Scalar recoding is popular to speed up ECC scalar multiplication: non-adjacent form, double-base number system, multi-base number system. But fast recoding methods require pre-computations: multiples of base point or off-line conversion. In this paper, we present a multi-base recoding method for ECC scalar multiplication based on i) a greedy algorithm starting least significant terms first, ii) cheap divisibility tests by multi-base elements and iii) fast exact divisions by multi-base elements. Multi-base terms are obtained on-the-fly using a special recoding unit which operates in parallel to curve-level operations and at very high speed. This ensures that all recoding steps are performed fast enough to schedule the next curve-level operations without interruptions. The proposed method can be fully implemented in hardware without pre-computations. We report FPGA implementation details and very good performances compared to state-of-art results.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Boldo, Jacques-Henri Jourdan, X. Leroy, G. Melquiond
Floating-point arithmetic is known to be tricky: roundings, formats, exceptional values. The IEEE-754 standard was a push towards straightening the field and made formal reasoning about floating-point computations easier and flourishing. Unfortunately, this is not sufficient to guarantee the final result of a program, as several other actors are involved: programming language, compiler, architecture. The Comp Certformally-verified compiler provides a solution to this problem: this compiler comes with a mathematical specification of the semantics of its source language (a large subset of ISO C90) and target platforms (ARM, PowerPC, x86-SSE2), and with a proof that compilation preserves semantics. In this paper, we report on our recent success in formally specifying and proving correct Comp Cert's compilation of floating-point arithmetic. Since CompCert is verified using the Coq proof assistant, this effort required a suitable Coq formalization of the IEEE-754 standard, we extended the Flocq library for this purpose. As a result, we obtain the first formally verified compiler that provably preserves the semantics of floating-point programs.
{"title":"A Formally-Verified C Compiler Supporting Floating-Point Arithmetic","authors":"S. Boldo, Jacques-Henri Jourdan, X. Leroy, G. Melquiond","doi":"10.1109/ARITH.2013.30","DOIUrl":"https://doi.org/10.1109/ARITH.2013.30","url":null,"abstract":"Floating-point arithmetic is known to be tricky: roundings, formats, exceptional values. The IEEE-754 standard was a push towards straightening the field and made formal reasoning about floating-point computations easier and flourishing. Unfortunately, this is not sufficient to guarantee the final result of a program, as several other actors are involved: programming language, compiler, architecture. The Comp Certformally-verified compiler provides a solution to this problem: this compiler comes with a mathematical specification of the semantics of its source language (a large subset of ISO C90) and target platforms (ARM, PowerPC, x86-SSE2), and with a proof that compilation preserves semantics. In this paper, we report on our recent success in formally specifying and proving correct Comp Cert's compilation of floating-point arithmetic. Since CompCert is verified using the Coq proof assistant, this effort required a suitable Coq formalization of the IEEE-754 standard, we extended the Flocq library for this purpose. As a result, we obtain the first formally verified compiler that provably preserves the semantics of floating-point programs.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125361771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susanne Engels, E. Kavun, C. Paar, Tolga Yalçin, Hristina Mihajloska
Modern cryptography today is substantially involved with securing lightweight (and pervasive) devices. For this purpose, several lightweight cryptographic algorithms have already been proposed. Up to now, the literature has focused on hardware-efficiency while lightweight with respect to software has barely been addressed. However, a large percentage of lightweight ciphers will be implemented on embedded CPUs- without support for cryptographic operations. In parallel, many lightweight ciphers are based on operations which are hardware-friendly but quite costly in software. For instance, bit permutations that accrue essentially no costs in hardware require a non-trivial number of CPU cycles and/or lookup tables in software. Similarly, S-Boxes often require relatively large lookup tables in software. In this work, we try to address the open question of efficient cipher implementations on small CPUs by introducing a non-linear/linear instruction set extension, to which we refer to as NLU, capable of implementing on-linear operations expressed in their algebraic normal form(ANF) and linear operations expressed in binary "matrix multiply-and-add" form. The proposed NLU is targeted for embedded micro controllers and it is therefore 8-bit wide. However, its modular architecture allows it to be used in16, 32, 64 and even 4-bit CPUs. We furthermore present examples of the use of NLU in the implementation of standard cryptographic algorithms in order to demonstrate its coding advantage.
{"title":"A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers","authors":"Susanne Engels, E. Kavun, C. Paar, Tolga Yalçin, Hristina Mihajloska","doi":"10.1109/ARITH.2013.36","DOIUrl":"https://doi.org/10.1109/ARITH.2013.36","url":null,"abstract":"Modern cryptography today is substantially involved with securing lightweight (and pervasive) devices. For this purpose, several lightweight cryptographic algorithms have already been proposed. Up to now, the literature has focused on hardware-efficiency while lightweight with respect to software has barely been addressed. However, a large percentage of lightweight ciphers will be implemented on embedded CPUs- without support for cryptographic operations. In parallel, many lightweight ciphers are based on operations which are hardware-friendly but quite costly in software. For instance, bit permutations that accrue essentially no costs in hardware require a non-trivial number of CPU cycles and/or lookup tables in software. Similarly, S-Boxes often require relatively large lookup tables in software. In this work, we try to address the open question of efficient cipher implementations on small CPUs by introducing a non-linear/linear instruction set extension, to which we refer to as NLU, capable of implementing on-linear operations expressed in their algebraic normal form(ANF) and linear operations expressed in binary \"matrix multiply-and-add\" form. The proposed NLU is targeted for embedded micro controllers and it is therefore 8-bit wide. However, its modular architecture allows it to be used in16, 32, 64 and even 4-bit CPUs. We furthermore present examples of the use of NLU in the implementation of standard cryptographic algorithms in order to demonstrate its coding advantage.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126950434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We analyze the performance of the unary arithmetical algorithm which computes a Moebius transformation in bimodular number systems which extend the binary signed system. We give statistical evidence that in some of these systems, the algorithm has linear average time complexity.
{"title":"The Unary Arithmetical Algorithm in Bimodular Number Systems","authors":"P. Kurka, M. Delacourt","doi":"10.1109/ARITH.2013.10","DOIUrl":"https://doi.org/10.1109/ARITH.2013.10","url":null,"abstract":"We analyze the performance of the unary arithmetical algorithm which computes a Moebius transformation in bimodular number systems which extend the binary signed system. We give statistical evidence that in some of these systems, the algorithm has linear average time complexity.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128781121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper is concerned with the fast and accurate evaluation of elementary symmetric functions. We present a new compensated algorithm by applying error-free transformations to improve the accuracy of the so-called Summation Algorithm, which is used, by example, in the MATLAB's poly function. We derive a forward round off error bound and running error bound for our new algorithm. The round off error bound implies that the computed result is as accurate as if computed with twice the working precision and then rounded to the current working precision. The running error analysis provides a shaper bound along with the result, without increasing significantly the computational cost. Numerical experiments illustrate that our algorithm runs much faster than the algorithm using the classic double-double library while sharing similar error estimates. Such an algorithm can be widely applicable for example to compute characteristic polynomials from eigen values. It can also be used into the Rasch model in psychological measurement.
{"title":"Accurate and Fast Evaluation of Elementary Symmetric Functions","authors":"Hao Jiang, S. Graillat, R. Barrio","doi":"10.1109/ARITH.2013.18","DOIUrl":"https://doi.org/10.1109/ARITH.2013.18","url":null,"abstract":"This paper is concerned with the fast and accurate evaluation of elementary symmetric functions. We present a new compensated algorithm by applying error-free transformations to improve the accuracy of the so-called Summation Algorithm, which is used, by example, in the MATLAB's poly function. We derive a forward round off error bound and running error bound for our new algorithm. The round off error bound implies that the computed result is as accurate as if computed with twice the working precision and then rounded to the current working precision. The running error analysis provides a shaper bound along with the result, without increasing significantly the computational cost. Numerical experiments illustrate that our algorithm runs much faster than the algorithm using the classic double-double library while sharing similar error estimates. Such an algorithm can be widely applicable for example to compute characteristic polynomials from eigen values. It can also be used into the Rasch model in psychological measurement.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125703857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bilgiday Yuce, H. F. Ugurdag, Sezer Gören, Günhan Dündar
Finding the value and/or address (position) of the maximum element of a set of binary numbers is a fundamental arithmetic operation. Numerous systems, which are used in different application areas, require fast (low-latency) circuits to carry out this operation. We propose a fast circuit topology called Array-Based maximum finder (AB) to determine both value and address of the maximum element within an n-element set of k-bit binary numbers. AB is based on carrying out all of the required comparisons in parallel and then simultaneously computing the address as well as the value of the maximum element. This approach ends up with only one comparator on the critical path, followed by some selection logic. The time complexity of the proposed architecture is O(log2n + log2k) whereas the area complexity is O(n2k). We developed RTL code generators for AB as well as its competitors. These generators are scalable to any value of n and k. We applied a standard-cell based iterative synthesis flow that finds the optimum time constraint through binary search. The synthesis results showed that AB is 1.2-2.1 times (1.6 times on the average) faster than the state-of-the-art.
{"title":"A Fast Circuit Topology for Finding the Maximum of N k-bit Numbers","authors":"Bilgiday Yuce, H. F. Ugurdag, Sezer Gören, Günhan Dündar","doi":"10.1109/ARITH.2013.35","DOIUrl":"https://doi.org/10.1109/ARITH.2013.35","url":null,"abstract":"Finding the value and/or address (position) of the maximum element of a set of binary numbers is a fundamental arithmetic operation. Numerous systems, which are used in different application areas, require fast (low-latency) circuits to carry out this operation. We propose a fast circuit topology called Array-Based maximum finder (AB) to determine both value and address of the maximum element within an n-element set of k-bit binary numbers. AB is based on carrying out all of the required comparisons in parallel and then simultaneously computing the address as well as the value of the maximum element. This approach ends up with only one comparator on the critical path, followed by some selection logic. The time complexity of the proposed architecture is O(log2n + log2k) whereas the area complexity is O(n2k). We developed RTL code generators for AB as well as its competitors. These generators are scalable to any value of n and k. We applied a standard-cell based iterative synthesis flow that finds the optimum time constraint through binary search. The synthesis results showed that AB is 1.2-2.1 times (1.6 times on the average) faster than the state-of-the-art.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129836387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Rupley, J. King, Eric Quinnell, F. Galloway, Ken Patton, P. Seidel, James Dinh, Hai Bui, A. Bhowmik
The AMD Jaguar x86 core uses a fully-synthesized, 128-bit native floating-point unit (FPU) built as a co-processor model. The Jaguar FPU supports several x86 ISA extensions, including x87, MMX, SSE1 through SSE4.2, AES, CLMUL, AVX, and F16C instruction sets. The front end of the unit decodes two complex operations per cycle and uses a dedicated renamer (RN), free list (FL), and retire queue (RQ) for in-order dispatch and retire. The FPU issues to the execution units with a dedicated out-of-order, dual-issue scheduler. Execution units source operands from a synthesized physical register file (PRF) and bypass network. The back end of the unit has two execution pipes: the first pipe contains a vector integer ALU, a vector integer MUL unit, and a floating-point adder (FPA), the second pipe contains a vector integer ALU, a store-convert unit, and a floating-point iterative multiplier (FPM). The implementation of the unit focused on low-power design and on vectorized single-precision (SP) performance optimizations. The verification of the unit required complex pseudo-random and formal verification techniques. The Jaguar FPU is built in a 28nm CMOS process.
{"title":"The Floating-Point Unit of the Jaguar x86 Core","authors":"J. Rupley, J. King, Eric Quinnell, F. Galloway, Ken Patton, P. Seidel, James Dinh, Hai Bui, A. Bhowmik","doi":"10.1109/ARITH.2013.24","DOIUrl":"https://doi.org/10.1109/ARITH.2013.24","url":null,"abstract":"The AMD Jaguar x86 core uses a fully-synthesized, 128-bit native floating-point unit (FPU) built as a co-processor model. The Jaguar FPU supports several x86 ISA extensions, including x87, MMX, SSE1 through SSE4.2, AES, CLMUL, AVX, and F16C instruction sets. The front end of the unit decodes two complex operations per cycle and uses a dedicated renamer (RN), free list (FL), and retire queue (RQ) for in-order dispatch and retire. The FPU issues to the execution units with a dedicated out-of-order, dual-issue scheduler. Execution units source operands from a synthesized physical register file (PRF) and bypass network. The back end of the unit has two execution pipes: the first pipe contains a vector integer ALU, a vector integer MUL unit, and a floating-point adder (FPA), the second pipe contains a vector integer ALU, a store-convert unit, and a floating-point iterative multiplier (FPM). The implementation of the unit focused on low-power design and on vectorized single-precision (SP) performance optimizations. The verification of the unit required complex pseudo-random and formal verification techniques. The Jaguar FPU is built in a 28nm CMOS process.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127559718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}