2011 IEEE 20th Symposium on Computer Arithmetic最新文献

英文中文

A Prescale-Lookup-Postscale Additive Procedure for Obtaining a Single Precision Ulp Accurate Reciprocal 获得单精度Ulp精确倒数的尺度前-查找-尺度后加性方法

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.31

D. Matula, Mihai T. Panu

We investigate the utilization of additive prescaling for reducing the input range for determining a divisor reciprocal approximation. In particular, we demonstrate that a single precision (24-bit) ulp accurate monotonic approximate reciprocal operation can be obtained employing three back-to-back three term additions and bipartite table lookup with total table size less than 6 Kbytes.

我们研究了利用加性预标度来减小用于确定除数倒数近似的输入范围。特别是，我们证明了单精度(24位)ulp精确单调近似倒数运算可以使用三个背对背的三项加法和二部表查找，总表大小小于6 kb。

引用次数: 2

Latency Sensitive FMA Design 延迟敏感FMA设计

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.26

Sameh Galal, M. Horowitz

The implementation of merged floating-point multiply-add operations can be optimized in many ways. For latency sensitive applications, our cascade design reduces the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in FP stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average improvement in CPI in 2-way machines and 4.6% in 4-way machines. The cascade design has the same area and energy budget as a traditional fused multiple-add FMA.

合并浮点乘加运算的实现可以通过多种方式进行优化。对于延迟敏感型应用，我们的级联设计比融合设计减少了2倍的累积相关延迟，但代价是非累积相关延迟增加了13%。一个简单的顺序执行模型表明，这种设计在大多数应用程序中都是优越的，平均减少了12%的FP延迟，并将性能提高了6%。对超标量无序机器的模拟显示，在2路机器中CPI平均提高了4%，在4路机器中提高了4.6%。该级联设计与传统的融合多加FMA具有相同的面积和能量预算。

引用次数: 12

Self Checking in Current Floating-Point Units 当前浮点单位的自检

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.18

Daniel Lipetz, E. Schwarz

High performance microprocessors are protected against transient and early end of life failures using a variety of error detection and fault isolation technologies. Execution units can be protected with duplication, parity prediction, or residue checking. Residue checking has an advantage due to its small size. A modulus is selected based on the radix of the numbers being checked. In a decimal floating-point unit there are two types of numbers in different bases. There are base 10 decimal numbers and base 2 integers being used. A residue checking system that makes it easy to check both base 2 and 10 numbers is discussed. Current state of the art designs that are currently in use are described as well as a novel hybrid moduli 9 and 3 residue system. The checking systems for the decimal and binary floating-point units of some recent IBM microprocessors including the Power6, Power7, z10, and z196 microprocessors are detailed.

高性能微处理器使用各种错误检测和故障隔离技术来防止瞬态和早期寿命结束故障。执行单元可以通过重复、奇偶预测或剩余检查来保护。残留检查的优点是它的体积小。模数是根据被检查的数字的基数选择的。在十进制浮点单位中，有两种不同进制的数。其中使用了以10为基数的十进制数和以2为基数的整数。讨论了一种便于检验以2为基数和以10为基数的数的余数检验系统。描述了目前正在使用的最先进的设计，以及一种新的混合模9和3残留系统。详细介绍了一些最新的IBM微处理器(包括Power6、Power7、z10和z196微处理器)的十进制和二进制浮点单元的检查系统。

引用次数: 26

The IBM zEnterprise-196 Decimal Floating-Point Accelerator IBM zEnterprise-196十进制浮点加速器

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.27

S. Carlough, Adam Collura, S. M. Müller, M. Kroener

Decimal floating-point Arithmetic is widely used in commercial computing applications, such as financial transactions, where rounding errors prevent the use of binary floating-point operations. The revised IEEE Standard for Floating-Point Arithmetic (IEEE-754-2008) defined standardized decimal floating-point (DFP) formats. As more software applications adopt the IEEE decimal floating-point standard, hardware accelerators that support it are becoming more prevalent. This paper describes the second generation decimal floating-point accelerator implemented on the IBM zEnterprise-196 processor. The 4-cycle deep pipeline was designed to optimize the latency of fixed-point decimal operations while significantly improving the bandwidth of DFP operations. A detailed description of the unit and a comparison to previous implementations found in literature is provided.

十进制浮点运算广泛用于商业计算应用程序，例如金融交易，其中舍入误差阻止使用二进制浮点运算。修订后的IEEE浮点运算标准(IEEE-754-2008)定义了标准化的十进制浮点(DFP)格式。随着越来越多的软件应用程序采用IEEE十进制浮点数标准，支持该标准的硬件加速器也变得越来越普遍。本文介绍了在IBM zEnterprise-196处理器上实现的第二代十进制浮点加速器。设计了4周期深管道，优化了定点小数运算的延迟，同时显著提高了DFP运算的带宽。提供了该单元的详细描述以及与文献中发现的先前实现的比较。

引用次数: 47

On the Fixed-Point Accuracy Analysis and Optimization of FFT Units with CORDIC Multipliers 带CORDIC乘法器的FFT单元不动点精度分析与优化

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.17

O. Sarbishei, K. Radecka

Fixed-point Fast Fourier Transform (FFT) units are widely used in digital communication systems. The twiddle multipliers required for realizing large FFTs are typically implemented with the Coordinate Rotation Digital Computer (CORDIC) algorithm to restrict memory requirements. Recent approaches aiming to optimize the bit-widths of FFT units while satisfying a given maximum bound on Mean-Square-Error (MSE) mostly focus on the architectures with integer multipliers. They ignore the quantization error of coefficients, disabling them to analyze the exact error defined as the difference between the fixed-point circuit and the reference floating-point model. This paper presents an efficient analysis of MSE as well as an optimization algorithm for CORDIC-based FFT units, which is applicable to other Linear-Time-Invariant (LTI) circuits as well.

定点快速傅立叶变换(FFT)单元广泛应用于数字通信系统中。实现大型fft所需的旋转乘法器通常使用坐标旋转数字计算机(CORDIC)算法来实现，以限制内存需求。最近的方法旨在优化FFT单元的比特宽度，同时满足给定的均方误差(MSE)的最大界限，主要集中在整数乘法器的体系结构上。它们忽略了系数的量化误差，使它们无法分析定义为定点电路与参考浮点模型之差的精确误差。本文给出了基于cordic的FFT单元的MSE分析和优化算法，该算法也适用于其他线性时不变(LTI)电路。

引用次数: 12

Accelerating Computations on FPGA Carry Chains by Operand Compaction 通过操作数压缩加速FPGA进位链的计算

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.22

Thomas B. Preußer, M. Zabel, R. Spallek

This work describes the carry-compact addition (CCA), a novel addition scheme that allows the acceleration of carry-chain computations on contemporary FPGA devices. While based on concepts known from the carry-look ahead addition and from parallel prefix adders, their adaptation by the CCA takes the context of an FPGA as implementation environment into account. These typically provide carry-chain structures to accelerate the simple ripple-carry addition (RCA). Rather than contrasting this scheme with the hierarchical addition approaches favored in hard-core VLSI designs, the CCA combines the benefits of both and uses hierarchical structures to shorten the critical path, which is still left on a core carry chain. In contrast to previous studies examining the asymptotically superior parallel prefix adders on FPGAs, the CCA is shown to outperform the standard RCA already for operand widths starting at 50~bits. Wider adders such as used in extended-precision floating-point units and in cryptographic applications even benefit from increasing speedups. The concrete mapping of the CCA as achieved for current Xilinx and Altera architectures is described and shown to be very favorable so as to yield a high speedup for a very modest investment of additional LUT resources.

这项工作描述了进位紧凑型加法(CCA)，这是一种新颖的加法方案，可以加速当代FPGA设备上的进位链计算。虽然基于从前移加法和并行前缀加法器中已知的概念，但CCA对它们的适应考虑了FPGA作为实现环境的上下文。它们通常提供携带链结构来加速简单的波纹携带加法(RCA)。CCA并没有将这种方案与硬核VLSI设计中青睐的分层加法方法进行对比，而是结合了两者的优点，并使用分层结构来缩短关键路径，而关键路径仍然留在核心进位链上。与之前研究fpga上渐近优越的并行前缀加法器相比，CCA在操作数宽度从50~bits开始时已经优于标准RCA。更宽的加法器，例如用于扩展精度浮点单元和加密应用程序的加法器，甚至可以从提高速度中受益。对于当前Xilinx和Altera架构实现的CCA的具体映射进行了描述，并显示出非常有利的效果，从而以非常适度的额外LUT资源投资产生高加速。

{"title":"Accelerating Computations on FPGA Carry Chains by Operand Compaction","authors":"Thomas B. Preußer, M. Zabel, R. Spallek","doi":"10.1109/ARITH.2011.22","DOIUrl":"https://doi.org/10.1109/ARITH.2011.22","url":null,"abstract":"This work describes the carry-compact addition (CCA), a novel addition scheme that allows the acceleration of carry-chain computations on contemporary FPGA devices. While based on concepts known from the carry-look ahead addition and from parallel prefix adders, their adaptation by the CCA takes the context of an FPGA as implementation environment into account. These typically provide carry-chain structures to accelerate the simple ripple-carry addition (RCA). Rather than contrasting this scheme with the hierarchical addition approaches favored in hard-core VLSI designs, the CCA combines the benefits of both and uses hierarchical structures to shorten the critical path, which is still left on a core carry chain. In contrast to previous studies examining the asymptotically superior parallel prefix adders on FPGAs, the CCA is shown to outperform the standard RCA already for operand widths starting at 50~bits. Wider adders such as used in extended-precision floating-point units and in cryptographic applications even benefit from increasing speedups. The concrete mapping of the CCA as achieved for current Xilinx and Altera architectures is described and shown to be very favorable so as to yield a high speedup for a very modest investment of additional LUT resources.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127989783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Efficient SIMD Arithmetic Modulo a Mersenne Number 高效SIMD算术模梅森数

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.37

Joppe W. Bos, T. Kleinjung, A. Lenstra, P. L. Montgomery

This paper describes carry-less arithmetic operations modulo an integer 2^M-1 in the thousand-bit range, targeted at single instruction multiple data platforms and applications where overall throughput is the main performance criterion. Using an implementation on a cluster of PlayStation 3 game consoles a new record was set for the elliptic curve method for integer factorization.

针对以总吞吐量为主要性能标准的单指令多数据平台和应用，本文描述了以整数2^M-1为模的千比特范围内的无进位算术运算。通过在PlayStation 3游戏机集群上的实现，创造了椭圆曲线法进行整数分解的新记录。

引用次数: 23

Teraflop FPGA Design Teraflop FPGA设计

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.32

M. Langhammer

User requirements for signal processing have increased in line with, or greater than, the increase in FPGA resources and capability. Many current signal processing algorithms require floating point, especially for military applications such as radar. Also, the increasing system complexity of these designs necessitate increased designer productivity, and floating point allows an easier implementation of the system model than the fixed point arithmetic that FPGA devices have been traditionally architected for. This article will review devices and methods for achieving consistent high performance system implementations in floating point. Single device designs at over 200 GFLOPs at the 40nm node, and approaching 1 Teraflop at 28nm will be described.

用户对信号处理的要求随着FPGA资源和能力的增加而增加，甚至大于FPGA资源和能力的增加。目前许多信号处理算法都需要浮点数，特别是在雷达等军事应用中。此外，这些设计的系统复杂性不断增加，需要提高设计人员的工作效率，并且与FPGA设备传统架构的定点算法相比，浮点算法允许更容易地实现系统模型。本文将回顾在浮点中实现一致的高性能系统实现的设备和方法。将描述在40nm节点上超过200 gflop的单器件设计，以及在28nm节点上接近1 Teraflop的设计。

引用次数: 1

Automatic Generation of Code for the Evaluation of Constant Expressions at Any Precision with a Guaranteed Error Bound 具有保证误差范围的任意精度常数表达式求值代码的自动生成

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.38

S. Chevillard

The evaluation of special functions often involves the evaluation of numerical constants. When the precision of the evaluation is known in advance (e.g., when developing libms) these constants are simply precomputed once and for all. In contrast, when the precision is dynamically chosen by the user (e.g., in multiple precision libraries), the constants must be evaluated on the fly at the required precision and with a rigorous error bound. Often, such constants are easily obtained by means of formulas involving simple numbers and functions. In principle, it is not a difficult work to write multiple precision code for evaluating such formulas with a rigorous round off analysis: one only has to study how round off errors propagate through sub expressions. However, this work is painful and error-prone and it is difficult for a human being to be perfectly rigorous in this process. Moreover, the task quickly becomes impractical when the size of the formula grows. In this article, we present an algorithm that takes as input a constant formula and that automatically produces code for evaluating it in arbitrary precision with a rigorous error bound. It has been implemented in the Solly a free software tool and its behavior is illustrated on several examples.

特殊函数的求值常常涉及数值常数的求值。当计算的精度事先已知时(例如，在开发libms时)，这些常数只需一次性预先计算。相反，当精度由用户动态选择时(例如，在多个精度库中)，必须在所需的精度和严格的误差范围内动态计算常量。通常，这些常数很容易通过包含简单数字和函数的公式得到。原则上，使用严格的舍入分析编写用于计算此类公式的多重精度代码并不困难:只需要研究舍入误差如何通过子表达式传播。然而，这项工作是痛苦的，容易出错，人类很难在这个过程中做到完美严谨。此外，当公式的大小增加时，该任务很快变得不切实际。在本文中，我们提出了一种算法，该算法将常数公式作为输入，并自动生成代码，在严格的误差范围内以任意精度对其进行计算。它已经在自由软件工具Solly中实现，并通过几个示例说明了它的行为。

{"title":"Automatic Generation of Code for the Evaluation of Constant Expressions at Any Precision with a Guaranteed Error Bound","authors":"S. Chevillard","doi":"10.1109/ARITH.2011.38","DOIUrl":"https://doi.org/10.1109/ARITH.2011.38","url":null,"abstract":"The evaluation of special functions often involves the evaluation of numerical constants. When the precision of the evaluation is known in advance (e.g., when developing libms) these constants are simply precomputed once and for all. In contrast, when the precision is dynamically chosen by the user (e.g., in multiple precision libraries), the constants must be evaluated on the fly at the required precision and with a rigorous error bound. Often, such constants are easily obtained by means of formulas involving simple numbers and functions. In principle, it is not a difficult work to write multiple precision code for evaluating such formulas with a rigorous round off analysis: one only has to study how round off errors propagate through sub expressions. However, this work is painful and error-prone and it is difficult for a human being to be perfectly rigorous in this process. Moreover, the task quickly becomes impractical when the size of the formula grows. In this article, we present an algorithm that takes as input a constant formula and that automatically produces code for evaluating it in arbitrary precision with a rigorous error bound. It has been implemented in the Solly a free software tool and its behavior is illustrated on several examples.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130159002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Large-Scale HPC Applications Using FPGAs 使用fpga加速大规模HPC应用

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.34

R. Dimond, S. Racanière, O. Pell

Field Programmable Gate Arrays (FPGAs) are conventionally considered as 'glue-logic'. However, modern FPGAs are extremely competitive compared to state-of-the-art CPUs for commercial HPC workloads, such as those found in Oil and Gas and Finance. For example, an FPGA accelerated system can be 31-37 times faster than an equivalently sized conventional machine, and consume 1/39 of the power. The key to achieving the best performance in FPGA accelerators, while maintaining correctness, is optimization of arithmetic units and data types to suit the range/precision at each point in the computation. The flexibility of the FPGA to implement non-standard arithmetic, combined with a data-flow programming model that instantiates a separate unit for each arithmetic operator in the code provides a wide design space. As such, FPGA computing offers significant opportunity for arithmetic research into 'large scale' HPC applications, where there is an opportunity to move away from standard IEEE formats, either to improve precision compared to the CPU version or to increase speed.

现场可编程门阵列(fpga)通常被认为是“粘合逻辑”。然而，与用于商业高性能计算工作负载的最先进的cpu相比，现代fpga极具竞争力，例如石油和天然气和金融领域的cpu。例如，FPGA加速系统可以比同等大小的传统机器快31-37倍，并且消耗1/39的功率。在保持正确性的同时，在FPGA加速器中实现最佳性能的关键是优化算术单元和数据类型，以适应计算中每个点的范围/精度。FPGA实现非标准算术的灵活性，结合数据流编程模型，为代码中的每个算术运算符实例化一个单独的单元，提供了广泛的设计空间。因此，FPGA计算为“大规模”HPC应用的算法研究提供了重要的机会，在这些应用中，有机会摆脱标准的IEEE格式，要么提高与CPU版本相比的精度，要么提高速度。

引用次数: 26

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 20th Symposium on Computer Arithmetic

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀