2011 IEEE 20th Symposium on Computer Arithmetic最新文献

英文中文

The Arithmetic Operators You Will Never See in a Microprocessor 你永远不会在微处理器上看到的算术运算符

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.33

F. D. Dinechin

It has been shown that FPGAs could outperform high-end microprocessors even on floating-point computations, thanks to massive parallelism. Too often, however, such studies re-implement in the FPGA the operators present in a processor. An FPGA can do much better: it can accomodate hardware operators that would make no economical sense in a general-purpose processor, and it can taylor them just right to the needs of the application. This talk tries to survey this idea systematically, discussing its potential, exhibiting some exotic (but useful) operators developed in the FloPoCo project, and listing some of the challenges ahead.

由于大规模的并行性，fpga甚至可以在浮点计算上胜过高端微处理器。然而，这种研究往往是在FPGA中重新实现处理器中存在的运算符。FPGA可以做得更好:它可以容纳在通用处理器中没有经济意义的硬件操作符，并且可以根据应用程序的需要调整它们。本讲座试图系统地调查这一想法，讨论其潜力，展示FloPoCo项目中开发的一些奇特(但有用)的操作方法，并列出未来的一些挑战。

引用次数: 6

Radix-16 Combined Division and Square Root Unit 基数-16联合除法和平方根单位

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.30

A. Nannarelli

Division and square root, based on the digit-recurrence algorithm, can be implemented in a combined unit. Several implementations of combined division/square root units have been presented mostly for radices 2 and 4. Here, we present a combined radix-16 unit obtained by overlapping two radix-4 result digit selection functions, as it is normally done for division only units. The latency of the unit is reduced by retiming and low power methods are applied as well. The proposed unit is compared to a radix-4 combined division/square root unit, and to a radix-16 unit, obtained by cascading two radix-4 stages, which is similar to the one implemented in a state-of-the-art processor.

基于数字递归算法的除法和平方根可以在一个组合单元中实现。已经提出了几种除法/平方根组合单位的实现，主要是针对基数2和4。在这里，我们提出了一个组合的基数-16单位，通过重叠两个基数-4的结果位数选择函数获得，因为它通常只用于除法单位。通过重定时和低功耗方法降低了单元的延迟。将所提出的单元与基数-4组合除法/平方根单元和基数-16单元进行比较，后者通过级联两个基数-4级获得，类似于在最先进的处理器中实现的单元。

引用次数: 13

Towards a Quaternion Complex Logarithmic Number System 四元数复对数系统

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.14

M. Arnold, J. Cowles, Vassilis Paliouras, I. Kouretas

The well-known generalization of real to complex arithmetic (two reals) extends further to more obscure quaternion arithmetic (four reals), which has applications in signal processing, aerospace, graphics and virtual reality. Quaternion multiplication implements 3D rotation, but is expensive (usually 16 floating-point multiplications and 12 additions). This paper proposes an alternative quaternion representation using logarithms to reduce multiplication cost. The real Logarithmic Number System (LNS) allows fast and inexpensive multiplication and division in embedded and FPGA-based systems. Recent advances in the Complex LNS (CLNS) [5] have made fast log-polar complex representation affordable. Although the quaternion logarithm function is also well-defined, it is not useful to simplify multiplication (in the same way real and complex logarithms are) because quaternion multiplication is not commutative but quaternion addition is. To overcome this, we propose a novel Quaternion Complex (QCLNS) representation using a pair of CLNS numbers. This representation implements quaternion multiplication using only the theoretical minimum [11], [15] of 8 LNS multipliers (i.e., fixed-point adders) and two CLNS adders. Because CLNS numbers are more compact than ordinary rectangular complex representation, single-precision QCLNS occupies 10.9 percent less memory than conventional quaternion representation. Extrapolating conventional LNS and floating-point synthesis data from Fu et al. [12], QCLNS saves on average 10 percent of FPGA resources for precisions between 13 and 45 bits.

众所周知的实数到复数算术(两个实数)的推广进一步扩展到更模糊的四元数算术(四个实数)，它在信号处理、航空航天、图形学和虚拟现实中都有应用。四元数乘法实现了3D旋转，但代价很高(通常需要16次浮点乘法和12次加法)。本文提出了一种使用对数的四元数表示法来减少乘法开销。真正的对数系统(LNS)允许在嵌入式和基于fpga的系统中快速和廉价的乘法和除法。复杂LNS (CLNS)的最新进展[5]使得快速对数极复数表示变得可以承受。尽管四元数对数函数也定义良好，但简化乘法(与实对数和复对数一样)是没有用的，因为四元数乘法不是交换的，但四元数加法是交换的。为了克服这个问题，我们提出了一种新的四元数复数(QCLNS)表示，使用一对四元数。这种表示法仅使用8个LNS乘法器(即定点加法器)和2个CLNS加法器的理论最小值[11]和[15]来实现四元数乘法。因为CLNS数字比普通的矩形复数表示更紧凑，所以单精度QCLNS比传统的四元数表示少占用10.9%的内存。根据Fu等人[12]的传统LNS和浮点合成数据推断，QCLNS在13到45位的精度上平均节省了10%的FPGA资源。

{"title":"Towards a Quaternion Complex Logarithmic Number System","authors":"M. Arnold, J. Cowles, Vassilis Paliouras, I. Kouretas","doi":"10.1109/ARITH.2011.14","DOIUrl":"https://doi.org/10.1109/ARITH.2011.14","url":null,"abstract":"The well-known generalization of real to complex arithmetic (two reals) extends further to more obscure quaternion arithmetic (four reals), which has applications in signal processing, aerospace, graphics and virtual reality. Quaternion multiplication implements 3D rotation, but is expensive (usually 16 floating-point multiplications and 12 additions). This paper proposes an alternative quaternion representation using logarithms to reduce multiplication cost. The real Logarithmic Number System (LNS) allows fast and inexpensive multiplication and division in embedded and FPGA-based systems. Recent advances in the Complex LNS (CLNS) [5] have made fast log-polar complex representation affordable. Although the quaternion logarithm function is also well-defined, it is not useful to simplify multiplication (in the same way real and complex logarithms are) because quaternion multiplication is not commutative but quaternion addition is. To overcome this, we propose a novel Quaternion Complex (QCLNS) representation using a pair of CLNS numbers. This representation implements quaternion multiplication using only the theoretical minimum [11], [15] of 8 LNS multipliers (i.e., fixed-point adders) and two CLNS adders. Because CLNS numbers are more compact than ordinary rectangular complex representation, single-precision QCLNS occupies 10.9 percent less memory than conventional quaternion representation. Extrapolating conventional LNS and floating-point synthesis data from Fu et al. [12], QCLNS saves on average 10 percent of FPGA resources for precisions between 13 and 45 bits.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125741370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Automatic Generation of Fast and Certified Code for Polynomial Evaluation 多项式求值的快速认证代码自动生成

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.39

C. Mouilleron, G. Revy

Designing an efficient floating-point implementation of a function based on polynomial evaluation requires being able to find an accurate enough evaluation code, exploiting at most the target architecture features. This article introduces CGPE, a tool dealing with the generation of fast and certified codes for the evaluation of bivariate polynomials. First we discuss the issue underlying the evaluation scheme combinatorics before giving an overview of the CGPE tool. The approach we propose consists in two steps: the generation of evaluation schemes by using some heuristics so as to quickly find some of low latency, and the selection that mainly consists in automatically checking their scheduling on the given target and validating their accuracy. Then, we present on-going development and ideas for possible improvements of the whole process. Finally, we illustrate the use of CGPE on some examples, and show how it allows us to generate fast and certified codes in a few seconds and thus to reduce the development time of libms like FLIP.

设计一个基于多项式求值的函数的高效浮点实现需要能够找到足够精确的求值代码，最多利用目标体系结构特性。本文介绍了CGPE，一个用于生成二元多项式求值的快速认证码的工具。首先，在概述cpe工具之前，我们讨论了评估方案组合学的基础问题。我们提出的方法包括两个步骤:一是利用一些启发式方法生成评估方案，以便快速找到一些低延迟的方案;二是选择主要是自动检查它们在给定目标上的调度并验证它们的准确性。然后，我们提出了正在进行的发展和整个过程可能改进的想法。最后，我们将在一些示例中说明CGPE的使用，并展示它如何使我们能够在几秒钟内生成快速且经过认证的代码，从而减少像FLIP这样的libm的开发时间。

引用次数: 27

Short Division of Long Integers 长整数的短除法

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.11

David Harvey, P. Zimmermann

We consider the problem of short division -- i.e., approximate quotient -- of multiple-precision integers. We present ready-to-implement algorithms that yield an approximation of the quotient, with tight and rigorous error bounds. We exhibit speedups of up to 30% with respect to GMP division with remainder, and up to 10% with respect to GMP short division, with room for further improvements. This work enables one to implement fast correctly rounded division routines in multiple-precision software tools.

考虑多精度整数的短除法问题，即近似商问题。我们提出了准备实现的算法，产生近似的商，具有严格和严格的误差界限。我们展示了在GMP分割和剩余分割方面高达30%的速度，在GMP短分割方面高达10%的速度，还有进一步改进的空间。这项工作使人们能够在多精度软件工具中实现快速正确的四舍五入除法例程。

引用次数: 2

Radix-8 Digit-by-Rounding: Achieving High-Performance Reciprocals, Square Roots, and Reciprocal Square Roots 基数-8位四舍五入:实现高性能的倒数、平方根和倒数平方根

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.28

J. A. Butts, P. T. P. Tang, R. Dror, D. Shaw

We describe a high-performance digit-recurrence algorithm for computing exactly rounded reciprocals, square roots, and reciprocal square roots in hardware at a rate of three result bits -- one radix-8 digit -- per recurrence iteration. To achieve a single-cycle recurrence at a short cycle time, we adapted the digit-by-rounding algorithm, which is normally applied at much higher radices, for efficient operation at radix 8. Using this approach avoids in the recurrence step the lookup table required by SRT -- the usual algorithm used for hardware digit recurrences. The increasing access latency of this table, the size of which grows super linearly in the radix, limits high-frequency SRT implementations to radix 4 or lower. We also developed a series of novel optimizations focused on further reducing the critical path through the recurrence. We propose, for example, decreasing data path widths to a point where erroneous results sometimes occur and then correcting these errors off the critical path. We present a specific implementation that computes any of these functions to 31 bits of precision in 13 cycles. Our implementation achieves a cycle time only 11% longer than the best reported SRT design for the same functions, yet delivers results in five fewer cycles. Finally, we show that even at lower radices, a digit-by-rounding design is likely to have a shorter critical path than one using SRT at the same radix.

我们描述了一种高性能的数字递归算法，用于在硬件中以每次递归迭代的三个结果位(一个基数-8位)的速率精确计算舍入倒数、平方根和倒数平方根。为了在短周期时间内实现单周期递归，我们采用了通常应用于更高基数的舍入算法，以便在基数8上进行有效操作。使用这种方法可以避免在递归步骤中使用SRT所需的查找表——SRT是用于硬件数字递归的常用算法。这个表的访问延迟不断增加，它的大小在基数中呈超线性增长，这限制了高频SRT实现的基数为4或更低。我们还开发了一系列新的优化方法，重点是通过递归进一步减少关键路径。例如，我们建议将数据路径宽度减小到有时会出现错误结果的程度，然后在关键路径上纠正这些错误。我们提出了一个具体的实现，在13个周期内将这些函数中的任何一个计算到31位精度。对于相同的功能，我们的实现实现的周期时间仅比目前报道的最佳SRT设计长11%，但交付结果的周期却少了5个。最后，我们表明，即使在较低的基数下，按四舍五入的数字设计可能比在相同基数下使用SRT的设计具有更短的关键路径。

{"title":"Radix-8 Digit-by-Rounding: Achieving High-Performance Reciprocals, Square Roots, and Reciprocal Square Roots","authors":"J. A. Butts, P. T. P. Tang, R. Dror, D. Shaw","doi":"10.1109/ARITH.2011.28","DOIUrl":"https://doi.org/10.1109/ARITH.2011.28","url":null,"abstract":"We describe a high-performance digit-recurrence algorithm for computing exactly rounded reciprocals, square roots, and reciprocal square roots in hardware at a rate of three result bits -- one radix-8 digit -- per recurrence iteration. To achieve a single-cycle recurrence at a short cycle time, we adapted the digit-by-rounding algorithm, which is normally applied at much higher radices, for efficient operation at radix 8. Using this approach avoids in the recurrence step the lookup table required by SRT -- the usual algorithm used for hardware digit recurrences. The increasing access latency of this table, the size of which grows super linearly in the radix, limits high-frequency SRT implementations to radix 4 or lower. We also developed a series of novel optimizations focused on further reducing the critical path through the recurrence. We propose, for example, decreasing data path widths to a point where erroneous results sometimes occur and then correcting these errors off the critical path. We present a specific implementation that computes any of these functions to 31 bits of precision in 13 cycles. Our implementation achieves a cycle time only 11% longer than the best reported SRT design for the same functions, yet delivers results in five fewer cycles. Finally, we show that even at lower radices, a digit-by-rounding design is likely to have a shorter critical path than one using SRT at the same radix.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122996157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A 1.5 Ghz VLIW DSP CPU with Integrated Floating Point and Fixed Point Instructions in 40 nm CMOS 在40纳米CMOS中集成浮点和定点指令的1.5 Ghz VLIW DSP CPU

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.20

T. Anderson, Duc Bui, S. Moharil, Soujanya Narnur, Mujibur Rahman, A. Lell, Eric Biscondi, A. Shrivastava, P. Dent, Mingjian Yan, Hasan Mahmood

A next generation VLIW DSP Central Processing Unit (CPU) which has an integrated fixed point and floating point Instruction Set Architecture (ISA) is presented. It is designed to meet a 1.5 GHz core clock frequency in a 40nm process with aggressive area and power goals. In this paper, the benchmarking process and benefits of newly defined instructions such as complex matrix multiply is explained. Also, the CPU data path is described in detail, highlighting several novel micro-architecture features. Finally, our design methodology as well as verification methodology to ensure functional correctness utilizing formal equivalent verification is described.

提出了一种集成定点和浮点指令集架构的新一代VLIW DSP中央处理器(CPU)。它的设计是为了满足在40nm工艺中的1.5 GHz核心时钟频率，具有侵略性的面积和功耗目标。本文阐述了新定义的复矩阵乘法等指令的基准测试过程和优点。此外，还详细描述了CPU数据路径，重点介绍了几个新的微体系结构特性。最后，描述了我们的设计方法以及使用形式等效验证来确保功能正确性的验证方法。

引用次数: 9

Composite Iterative Algorithm and Architecture for q-th Root Calculation q次方根计算的复合迭代算法与体系结构

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.16

Álvaro Vázquez, J. Bruguera

An algorithm for the q-th root extraction, q being any integer, is presented in this paper. The algorithm is based on an optimized implementation of X^{1/q} by a sequence of parallel and/or overlapped operations: (1) reciprocal, (2) digit-recurrence logarithm, (3) left-to-right carry-free multiplication and (4) on-line exponential. A detailed error analysis and two architectures are proposed, for low precision q and for higher precision q. The execution time and hardware requirements are estimated for single precision floating-point computations for several radices, this helps to determine which radices result in the most efficient implementations. The architectures proposed improve the features of other architectures for q-th root extraction.

本文提出了一种q为任意整数时的第q次根提取算法。该算法基于X^{1/q}的优化实现，通过一系列并行和/或重叠操作:(1)倒数，(2)数字递归对数，(3)从左到右无进位乘法和(4)在线指数。对低精度q和高精度q进行了详细的误差分析，并提出了两种架构。对多个基数的单精度浮点计算的执行时间和硬件需求进行了估计，这有助于确定哪种基数产生最有效的实现。所提出的体系结构改进了其他体系结构在q次根提取方面的特征。

引用次数: 12

High Degree Toom'n'Half for Balanced and Unbalanced Multiplication 平衡和不平衡乘法的高阶二分之一

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.12

Marco Bodrato

Some hints and tricks to automatically obtain high degree Toom-Cook implementations, i.e. functions for integer or polynomial multiplication with a reduced complexity. The described method generates quite an efficient sequence of operations and the memory footprint is kept low by using a new strategy: mixing evaluation, interpolation and recomposition phases. It is possible to automatise the whole procedure obtaining a general Toom-n function, and to extend the method to polynomials in any characteristic except two.

一些提示和技巧，自动获得高程度的Toom-Cook实现，即整数或多项式乘法的函数与降低复杂性。所描述的方法产生了相当高效的操作序列，并且使用了一种新的策略:混合计算，插值和重组阶段，从而保持了较低的内存占用。这是可能的自动化整个过程，以获得一般的托姆-n函数，并将该方法扩展到多项式的任何特征，除了两个。

引用次数: 4

Flocq: A Unified Library for Proving Floating-Point Algorithms in Coq 在Coq中证明浮点算法的统一库

2011 IEEE 20th Symposium on Computer Arithmetic

Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.40

S. Boldo, G. Melquiond

Several formalizations of floating-point arithmetic have been designed for the Coq system, a generic proof assistant. Their different purposes have favored some specific applications: program verification, high-level properties, automation. Based on our experience using and/or developing these libraries, we have built a new system that is meant to encompass the other ones in a unified framework. It offers a multi-radix and multi-precision formalization for various floating- and fixed-point formats. This fresh setting has been the occasion for reevaluating known properties and generalizing them. This paper presents design decisions and examples of theorems from the Flocq system: a library easy to use, suitable for automation yet high-level and generic.

为Coq系统(一个通用的证明辅助工具)设计了几个浮点算法的形式化。它们的不同用途有利于一些特定的应用:程序验证、高级属性、自动化。根据我们使用和/或开发这些库的经验，我们已经构建了一个新的系统，旨在将其他库包含在一个统一的框架中。它为各种浮点和定点格式提供了多基数和多精度的形式化。这种新的设置为重新评估已知属性和推广它们提供了机会。本文介绍了flock系统的设计决策和定理示例:一个易于使用的库，适合自动化，但高级和通用。

引用次数: 134

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 20th Symposium on Computer Arithmetic

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀