2013 IEEE 21st Symposium on Computer Arithmetic最新文献

英文中文

FPU Generator for Design Space Exploration 设计空间探索的FPU发生器

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.27

Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz

FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.

近一个世纪以来，fpu一直是一个研究课题，产生了数千篇论文和书籍。每个进步都集中在某些特定新技术的优点上。本文比较了吞吐量优化和延迟敏感设计的能源效率，每个设计都采用一系列优化技术，通过公平的“苹果对苹果”方法。这种比较要求我们构建许多优化的FP单元。我们通过创建一个高度参数化的FPgenerator来实现这一点，它分层地包含了用于求和树、Booth编码器、加法器等的低级生成器。在构建了这个生成器之后，我们很快重新学习了一些低级别的问题，这些问题很重要，但往往是论文中最容易忽视的。通过探索跨各种位宽度、求和树、展位编码器、流水线技术和管道深度的级联和融合乘加架构，我们发现对于大多数基于吞吐量的设计，带有Wallace组合树的booth -3融合乘加架构是最佳的。对于延迟设计，我们发现Booth-2级联乘加架构更好。正如我们在论文中所描述的那样，由于线延迟和轨道计数，Wallace并不总是最优的组合网络，并且CSA在树中连接的精确方式可能比所使用的树类型产生更大的差异。

{"title":"FPU Generator for Design Space Exploration","authors":"Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz","doi":"10.1109/ARITH.2013.27","DOIUrl":"https://doi.org/10.1109/ARITH.2013.27","url":null,"abstract":"FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair \"apples to apples\" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129583913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Another Look at Inversions over Binary Fields 再看一下二进制域上的反转

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.25

V. Dimitrov, K. Järvinen

In this paper we offer new algorithms for one of the most common operations in public key cryptosystems: the inversion over binary Galois fields. The new algorithms are based on using double-base and triple-base representations. They are provably more economical-in terms of the average number of multiplications-than the popular Itoh-Tsujii algorithm. In addition to having fewer multiplications, the new inversion algorithms offer further implementation advantages because they allow more efficient computation of squarings and, in some cases, require fewer temporary variables. The new algorithms are straightforwardly usable in both software and hardware implementations.

本文针对公钥密码系统中最常见的运算之一——二进制伽罗瓦域的反演，提出了一种新的算法。新的算法是基于使用双基和三基表示。从乘法的平均次数来看，它们可以证明比流行的伊藤-辻井算法更经济。除了具有更少的乘法之外，新的反转算法还提供了进一步的实现优势，因为它们允许更有效的平方计算，并且在某些情况下，需要更少的临时变量。新算法在软件和硬件实现中都可以直接使用。

引用次数: 20

Fault Detection in RNS Montgomery Modular Multiplication RNS Montgomery模乘法的故障检测

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.31

J. Bajard, J. Eynard, F. Gandino

Recent studies have demonstrated the importance of protecting the hardware implementations of cryptographic functions against side channel and fault attacks. In last years, very efficient implementations of modular arithmetic have been done in RNS (RSA, ECC, pairings) as well on FPGA as on GPU. Thus the protection of RNS Montgomery modular multiplication is a crucial issue. For that purpose, some techniques have been proposed to protect this RNS operation against side channel analysis. Nevertheless, there are still no effective and generic approaches for the detection of fault injection, which would be additionnally compatible with a leak resistant arithmetic. This paper proposes a new RNS Montgomery multiplication algorithm with fault detection capability. A mathematical analysis demonstrates the validity of the proposed approach. Moreover, an architecture that implements the proposed algorithm is presented. A comparative analysis shows that the introduction of the proposed fault detection technique requires only a limited increase in area.

最近的研究表明，保护加密功能的硬件实现免受侧信道和故障攻击的重要性。在过去的几年里，模块化算法已经在RNS (RSA, ECC，配对)以及FPGA和GPU上进行了非常有效的实现。因此，保护RNS蒙哥马利模乘法是一个至关重要的问题。为此，已经提出了一些技术来保护RNS操作免受侧信道分析的影响。然而，目前还没有一种有效的、通用的方法来检测故障注入，而这种方法将与抗泄漏算法额外兼容。提出了一种新的具有故障检测能力的RNS Montgomery乘法算法。数学分析证明了该方法的有效性。并给出了实现该算法的体系结构。对比分析表明，引入本文提出的故障检测技术只需要有限的面积增加。

引用次数: 23

On-the-Fly Multi-base Recoding for ECC Scalar Multiplication without Pre-computations 动态多基重编码的ECC标量乘法没有预先计算

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/arith.2013.17

Thomas Chabrier, A. Tisserand

Scalar recoding is popular to speed up ECC scalar multiplication: non-adjacent form, double-base number system, multi-base number system. But fast recoding methods require pre-computations: multiples of base point or off-line conversion. In this paper, we present a multi-base recoding method for ECC scalar multiplication based on i) a greedy algorithm starting least significant terms first, ii) cheap divisibility tests by multi-base elements and iii) fast exact divisions by multi-base elements. Multi-base terms are obtained on-the-fly using a special recoding unit which operates in parallel to curve-level operations and at very high speed. This ensures that all recoding steps are performed fast enough to schedule the next curve-level operations without interruptions. The proposed method can be fully implemented in hardware without pre-computations. We report FPGA implementation details and very good performances compared to state-of-art results.

标量编码是提高ECC标量乘法速度的常用方法:非邻接形式、双基数制、多基数制。但快速编码方法需要预先计算:基数的倍数或离线转换。本文提出了一种ECC标量乘法的多基编码方法，该方法基于i)从最不有效项开始的贪心算法，ii)多基元的廉价可整除性检验和iii)多基元的快速精确除法。多碱基项是使用一种特殊的编码单元实时获得的，该单元与曲线级操作并行，并且速度非常快。这确保了所有重新编码步骤的执行速度足够快，以便在不中断的情况下调度下一个曲线级操作。该方法可以完全在硬件上实现，无需预先计算。我们报告FPGA实现细节和与最先进的结果相比非常好的性能。

引用次数: 21

A Formally-Verified C Compiler Supporting Floating-Point Arithmetic 支持浮点运算的经过正式验证的C编译器

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.30

S. Boldo, Jacques-Henri Jourdan, X. Leroy, G. Melquiond

Floating-point arithmetic is known to be tricky: roundings, formats, exceptional values. The IEEE-754 standard was a push towards straightening the field and made formal reasoning about floating-point computations easier and flourishing. Unfortunately, this is not sufficient to guarantee the final result of a program, as several other actors are involved: programming language, compiler, architecture. The Comp Certformally-verified compiler provides a solution to this problem: this compiler comes with a mathematical specification of the semantics of its source language (a large subset of ISO C90) and target platforms (ARM, PowerPC, x86-SSE2), and with a proof that compilation preserves semantics. In this paper, we report on our recent success in formally specifying and proving correct Comp Cert's compilation of floating-point arithmetic. Since CompCert is verified using the Coq proof assistant, this effort required a suitable Coq formalization of the IEEE-754 standard, we extended the Flocq library for this purpose. As a result, we obtain the first formally verified compiler that provably preserves the semantics of floating-point programs.

众所周知，浮点运算很棘手:舍入、格式、异常值。IEEE-754标准推动了该领域的发展，使浮点计算的形式化推理变得更加容易和繁荣。不幸的是，这并不足以保证程序的最终结果，因为还涉及到其他几个因素:编程语言、编译器、体系结构。Comp经过证书验证的编译器为这个问题提供了一个解决方案:该编译器附带了其源语言(ISO C90的一个大子集)和目标平台(ARM、PowerPC、x86-SSE2)的语义的数学规范，并证明了编译保留了语义。在本文中，我们报告了我们最近在正式指定和证明正确的Comp证书的浮点运算编译方面的成功。由于CompCert是使用Coq证明助手进行验证的，因此这项工作需要对IEEE-754标准进行适当的Coq形式化，为此我们扩展了flock库。因此，我们获得了第一个经过正式验证的编译器，它可证明地保留了浮点程序的语义。

{"title":"A Formally-Verified C Compiler Supporting Floating-Point Arithmetic","authors":"S. Boldo, Jacques-Henri Jourdan, X. Leroy, G. Melquiond","doi":"10.1109/ARITH.2013.30","DOIUrl":"https://doi.org/10.1109/ARITH.2013.30","url":null,"abstract":"Floating-point arithmetic is known to be tricky: roundings, formats, exceptional values. The IEEE-754 standard was a push towards straightening the field and made formal reasoning about floating-point computations easier and flourishing. Unfortunately, this is not sufficient to guarantee the final result of a program, as several other actors are involved: programming language, compiler, architecture. The Comp Certformally-verified compiler provides a solution to this problem: this compiler comes with a mathematical specification of the semantics of its source language (a large subset of ISO C90) and target platforms (ARM, PowerPC, x86-SSE2), and with a proof that compilation preserves semantics. In this paper, we report on our recent success in formally specifying and proving correct Comp Cert's compilation of floating-point arithmetic. Since CompCert is verified using the Coq proof assistant, this effort required a suitable Coq formalization of the IEEE-754 standard, we extended the Flocq library for this purpose. As a result, we obtain the first formally verified compiler that provably preserves the semantics of floating-point programs.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125361771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers 轻量级密码的非线性/线性指令集扩展

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.36

Susanne Engels, E. Kavun, C. Paar, Tolga Yalçin, Hristina Mihajloska

Modern cryptography today is substantially involved with securing lightweight (and pervasive) devices. For this purpose, several lightweight cryptographic algorithms have already been proposed. Up to now, the literature has focused on hardware-efficiency while lightweight with respect to software has barely been addressed. However, a large percentage of lightweight ciphers will be implemented on embedded CPUs- without support for cryptographic operations. In parallel, many lightweight ciphers are based on operations which are hardware-friendly but quite costly in software. For instance, bit permutations that accrue essentially no costs in hardware require a non-trivial number of CPU cycles and/or lookup tables in software. Similarly, S-Boxes often require relatively large lookup tables in software. In this work, we try to address the open question of efficient cipher implementations on small CPUs by introducing a non-linear/linear instruction set extension, to which we refer to as NLU, capable of implementing on-linear operations expressed in their algebraic normal form(ANF) and linear operations expressed in binary "matrix multiply-and-add" form. The proposed NLU is targeted for embedded micro controllers and it is therefore 8-bit wide. However, its modular architecture allows it to be used in16, 32, 64 and even 4-bit CPUs. We furthermore present examples of the use of NLU in the implementation of standard cryptographic algorithms in order to demonstrate its coding advantage.

今天的现代密码学主要涉及保护轻量级(和普及)设备。为此，已经提出了几种轻量级加密算法。到目前为止，文献主要关注硬件效率，而关于软件的轻量级几乎没有得到解决。然而，很大比例的轻量级密码将在嵌入式cpu上实现——不支持加密操作。与此同时，许多轻量级密码基于对硬件友好但在软件上相当昂贵的操作。例如，在硬件中基本上不会产生成本的位排列需要大量的CPU周期和/或软件中的查找表。类似地，s - box通常在软件中需要相对较大的查找表。在这项工作中，我们试图通过引入非线性/线性指令集扩展来解决在小型cpu上有效实现密码的开放问题，我们称之为NLU，能够实现以代数范式(ANF)表示的非线性运算和以二进制“矩阵乘法和加法”形式表示的线性运算。所提出的NLU是针对嵌入式微控制器的，因此它是8位宽。然而，它的模块化架构允许它在16位、32位、64位甚至4位cpu中使用。我们进一步提出了在标准密码算法的实现中使用NLU的例子，以展示其编码优势。

{"title":"A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers","authors":"Susanne Engels, E. Kavun, C. Paar, Tolga Yalçin, Hristina Mihajloska","doi":"10.1109/ARITH.2013.36","DOIUrl":"https://doi.org/10.1109/ARITH.2013.36","url":null,"abstract":"Modern cryptography today is substantially involved with securing lightweight (and pervasive) devices. For this purpose, several lightweight cryptographic algorithms have already been proposed. Up to now, the literature has focused on hardware-efficiency while lightweight with respect to software has barely been addressed. However, a large percentage of lightweight ciphers will be implemented on embedded CPUs- without support for cryptographic operations. In parallel, many lightweight ciphers are based on operations which are hardware-friendly but quite costly in software. For instance, bit permutations that accrue essentially no costs in hardware require a non-trivial number of CPU cycles and/or lookup tables in software. Similarly, S-Boxes often require relatively large lookup tables in software. In this work, we try to address the open question of efficient cipher implementations on small CPUs by introducing a non-linear/linear instruction set extension, to which we refer to as NLU, capable of implementing on-linear operations expressed in their algebraic normal form(ANF) and linear operations expressed in binary \"matrix multiply-and-add\" form. The proposed NLU is targeted for embedded micro controllers and it is therefore 8-bit wide. However, its modular architecture allows it to be used in16, 32, 64 and even 4-bit CPUs. We furthermore present examples of the use of NLU in the implementation of standard cryptographic algorithms in order to demonstrate its coding advantage.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126950434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

The Unary Arithmetical Algorithm in Bimodular Number Systems 双模数系统中的一元算术算法

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.10

P. Kurka, M. Delacourt

We analyze the performance of the unary arithmetical algorithm which computes a Moebius transformation in bimodular number systems which extend the binary signed system. We give statistical evidence that in some of these systems, the algorithm has linear average time complexity.

本文分析了在双模数系统中计算莫比乌斯变换的一元算法的性能，该算法是对二进制符号系统的扩展。我们给出了统计证据，在某些系统中，算法具有线性平均时间复杂度。

引用次数: 5

Accurate and Fast Evaluation of Elementary Symmetric Functions 初等对称函数的精确快速求值

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.18

Hao Jiang, S. Graillat, R. Barrio

This paper is concerned with the fast and accurate evaluation of elementary symmetric functions. We present a new compensated algorithm by applying error-free transformations to improve the accuracy of the so-called Summation Algorithm, which is used, by example, in the MATLAB's poly function. We derive a forward round off error bound and running error bound for our new algorithm. The round off error bound implies that the computed result is as accurate as if computed with twice the working precision and then rounded to the current working precision. The running error analysis provides a shaper bound along with the result, without increasing significantly the computational cost. Numerical experiments illustrate that our algorithm runs much faster than the algorithm using the classic double-double library while sharing similar error estimates. Such an algorithm can be widely applicable for example to compute characteristic polynomials from eigen values. It can also be used into the Rasch model in psychological measurement.

本文研究了初等对称函数的快速准确求值问题。我们提出了一种新的补偿算法，通过应用无误差变换来提高所谓的求和算法的精度，该算法在MATLAB的多边形函数中得到了应用。给出了新算法的前向舍入误差界和运行误差界。舍入误差界意味着计算结果与用两倍的工作精度计算然后舍入到当前工作精度一样准确。运行误差分析与结果一起提供了一个成形器边界，而不会显著增加计算成本。数值实验表明，在误差估计相似的情况下，我们的算法比使用经典双双库的算法运行速度要快得多。这种算法可以广泛应用，例如从特征值计算特征多项式。它也可用于心理测量中的Rasch模型。

引用次数: 8

A Fast Circuit Topology for Finding the Maximum of N k-bit Numbers 一种求N个k位最大值的快速电路拓扑

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.35

Bilgiday Yuce, H. F. Ugurdag, Sezer Gören, Günhan Dündar

Finding the value and/or address (position) of the maximum element of a set of binary numbers is a fundamental arithmetic operation. Numerous systems, which are used in different application areas, require fast (low-latency) circuits to carry out this operation. We propose a fast circuit topology called Array-Based maximum finder (AB) to determine both value and address of the maximum element within an n-element set of k-bit binary numbers. AB is based on carrying out all of the required comparisons in parallel and then simultaneously computing the address as well as the value of the maximum element. This approach ends up with only one comparator on the critical path, followed by some selection logic. The time complexity of the proposed architecture is O(log2n + log2k) whereas the area complexity is O(n2k). We developed RTL code generators for AB as well as its competitors. These generators are scalable to any value of n and k. We applied a standard-cell based iterative synthesis flow that finds the optimum time constraint through binary search. The synthesis results showed that AB is 1.2-2.1 times (1.6 times on the average) faster than the state-of-the-art.

查找一组二进制数中最大元素的值和/或地址(位置)是一种基本的算术运算。用于不同应用领域的许多系统都需要快速(低延迟)电路来执行此操作。我们提出了一种称为基于数组的最大查找器(AB)的快速电路拓扑，用于确定k位二进制数的n元素集合中最大元素的值和地址。AB是基于并行执行所有需要的比较，然后同时计算地址和最大元素的值。这种方法最终在关键路径上只有一个比较器，然后是一些选择逻辑。该架构的时间复杂度为O(log2n + log2k)，而面积复杂度为O(n2k)。我们为AB及其竞争对手开发了RTL代码生成器。这些生成器可扩展到任意n和k值。我们应用了基于标准单元的迭代合成流，该流通过二进制搜索找到最佳时间约束。合成结果表明，AB的合成速度是现有工艺的1.2 ~ 2.1倍(平均1.6倍)。

{"title":"A Fast Circuit Topology for Finding the Maximum of N k-bit Numbers","authors":"Bilgiday Yuce, H. F. Ugurdag, Sezer Gören, Günhan Dündar","doi":"10.1109/ARITH.2013.35","DOIUrl":"https://doi.org/10.1109/ARITH.2013.35","url":null,"abstract":"Finding the value and/or address (position) of the maximum element of a set of binary numbers is a fundamental arithmetic operation. Numerous systems, which are used in different application areas, require fast (low-latency) circuits to carry out this operation. We propose a fast circuit topology called Array-Based maximum finder (AB) to determine both value and address of the maximum element within an n-element set of k-bit binary numbers. AB is based on carrying out all of the required comparisons in parallel and then simultaneously computing the address as well as the value of the maximum element. This approach ends up with only one comparator on the critical path, followed by some selection logic. The time complexity of the proposed architecture is O(log2n + log2k) whereas the area complexity is O(n2k). We developed RTL code generators for AB as well as its competitors. These generators are scalable to any value of n and k. We applied a standard-cell based iterative synthesis flow that finds the optimum time constraint through binary search. The synthesis results showed that AB is 1.2-2.1 times (1.6 times on the average) faster than the state-of-the-art.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129836387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

The Floating-Point Unit of the Jaguar x86 Core 捷豹x86核心的浮点单元

2013 IEEE 21st Symposium on Computer Arithmetic

Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.24

J. Rupley, J. King, Eric Quinnell, F. Galloway, Ken Patton, P. Seidel, James Dinh, Hai Bui, A. Bhowmik

The AMD Jaguar x86 core uses a fully-synthesized, 128-bit native floating-point unit (FPU) built as a co-processor model. The Jaguar FPU supports several x86 ISA extensions, including x87, MMX, SSE1 through SSE4.2, AES, CLMUL, AVX, and F16C instruction sets. The front end of the unit decodes two complex operations per cycle and uses a dedicated renamer (RN), free list (FL), and retire queue (RQ) for in-order dispatch and retire. The FPU issues to the execution units with a dedicated out-of-order, dual-issue scheduler. Execution units source operands from a synthesized physical register file (PRF) and bypass network. The back end of the unit has two execution pipes: the first pipe contains a vector integer ALU, a vector integer MUL unit, and a floating-point adder (FPA), the second pipe contains a vector integer ALU, a store-convert unit, and a floating-point iterative multiplier (FPM). The implementation of the unit focused on low-power design and on vectorized single-precision (SP) performance optimizations. The verification of the unit required complex pseudo-random and formal verification techniques. The Jaguar FPU is built in a 28nm CMOS process.

AMD Jaguar x86核心使用一个完全合成的128位原生浮点单元(FPU)作为协处理器模型。捷豹FPU支持多种x86 ISA扩展，包括x87、MMX、SSE1至SSE4.2、AES、CLMUL、AVX和F16C指令集。该单元的前端每个周期解码两个复杂的操作，并使用专用的重命名器(RN)、空闲列表(FL)和退役队列(RQ)进行有序调度和退役。FPU通过一个专用的乱序双问题调度程序向执行单元发出问题。执行单元从一个合成的物理寄存器文件(PRF)和旁路网络源操作数。该单元的后端有两个执行管道:第一个管道包含一个矢量整数ALU、一个矢量整数MUL单元和一个浮点加法器(FPA)，第二个管道包含一个矢量整数ALU、一个存储转换单元和一个浮点迭代乘法器(FPM)。该装置的实现侧重于低功耗设计和向量化单精度(SP)性能优化。该单元的验证需要复杂的伪随机和形式化验证技术。捷豹FPU采用28纳米CMOS工艺。

{"title":"The Floating-Point Unit of the Jaguar x86 Core","authors":"J. Rupley, J. King, Eric Quinnell, F. Galloway, Ken Patton, P. Seidel, James Dinh, Hai Bui, A. Bhowmik","doi":"10.1109/ARITH.2013.24","DOIUrl":"https://doi.org/10.1109/ARITH.2013.24","url":null,"abstract":"The AMD Jaguar x86 core uses a fully-synthesized, 128-bit native floating-point unit (FPU) built as a co-processor model. The Jaguar FPU supports several x86 ISA extensions, including x87, MMX, SSE1 through SSE4.2, AES, CLMUL, AVX, and F16C instruction sets. The front end of the unit decodes two complex operations per cycle and uses a dedicated renamer (RN), free list (FL), and retire queue (RQ) for in-order dispatch and retire. The FPU issues to the execution units with a dedicated out-of-order, dual-issue scheduler. Execution units source operands from a synthesized physical register file (PRF) and bypass network. The back end of the unit has two execution pipes: the first pipe contains a vector integer ALU, a vector integer MUL unit, and a floating-point adder (FPA), the second pipe contains a vector integer ALU, a store-convert unit, and a floating-point iterative multiplier (FPM). The implementation of the unit focused on low-power design and on vectorized single-precision (SP) performance optimizations. The verification of the unit required complex pseudo-random and formal verification techniques. The Jaguar FPU is built in a 28nm CMOS process.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127559718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 IEEE 21st Symposium on Computer Arithmetic

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀