17th IEEE Symposium on Computer Arithmetic (ARITH'05)最新文献

英文中文

A hardware algorithm for integer division 整数除法的硬件算法

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.6

N. Takagi, Shunsuke Kadowaki, K. Takagi

A hardware algorithm for integer division is proposed. It is based on the digit-recurrence, non-restoring division algorithm. Fast computation is achieved by the use of the radix-2 signed-digit representation. The algorithm does not require normalization of the divisor, and hence, does not require area-consuming leading one (or zero) detection nor shifts of variable-amount. Combinational (unfolded) implementation of the algorithm yields a regularly structured array divider, where pipelining is possible for increasing the throughput. Sequential implementation yields a compact divider.

提出了一种整数除法的硬件算法。它基于数字递归、非还原除法算法。快速计算是通过使用基数2的符号数表示来实现的。该算法不需要规格化除数，因此，不需要消耗面积的前导1(或零)检测，也不需要变量量的移位。该算法的组合(展开)实现产生了一个规则结构化的数组分配器，其中流水线可以提高吞吐量。顺序实现产生一个紧凑的除法器。

引用次数: 31

Quasi-pipelined hash circuits 准流水线哈希电路

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.36

Marco Macchetti, L. Dadda

Hash functions are an important cryptographic primitive. They are used to obtain a fixed-size fingerprint, or hash value, of an arbitrary long message. We focus particularly on the class of dedicated hash functions, whose general construction is presented; the peculiar arrangement of sequential and combinational units makes the application of pipelining techniques to these constructions not trivial. We formalize an optimization technique called quasi-pipelining, whose goal is to optimize the critical path and thus to increase the clock frequency in dedicated hardware implementations. The SHA-2 algorithm has been previously examined by Dadda et al, with specific versions of quasi-pipelining; a full generalization of the technique is presented, along with application to the SHA-1 algorithm. Quasi-pipelining could be as well applied to future hashing algorithms, provided they are designed along the same lines as those of the SHA family.

哈希函数是一种重要的密码原语。它们用于获取任意长消息的固定大小的指纹或散列值。我们特别关注专用哈希函数类，给出了其一般构造;顺序和组合单元的特殊排列使得流水线技术在这些结构中的应用变得非常重要。我们形式化了一种称为准流水线的优化技术，其目标是优化关键路径，从而提高专用硬件实现中的时钟频率。之前，Dadda等人已经对SHA-2算法进行了研究，使用了特定版本的准流水线;介绍了该技术的全面推广，以及在SHA-1算法中的应用。准流水线也可以应用到未来的散列算法中，只要它们的设计思路与SHA系列相同。

引用次数: 39

Synthesis of saturating counters using traditional and non-traditional basic counters 利用传统和非传统基本计数器合成饱和计数器

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.42

Zhaojun Wo, I. Koren

Saturating counters are a newly defined class of generalized parallel counters that provide the exact number of inputs which are equal to 1 only if this number is below a given threshold. Such counters are useful in, for example, self-test and repair units for embedded memories. This paper defines saturating counters for arbitrary threshold values and presents several alternatives for their implementation. The delay and area of the proposed design alternatives are then estimated using a 0.25/spl mu/m cell library. Finally, we study the behavior of saturating counters when the threshold approaches the number of input bits, i.e., the special case of non-saturating parallel counters.

饱和计数器是一类新定义的广义并行计数器，它提供输入的确切数量，只有当这个数字低于给定的阈值时才等于1。例如，这种计数器在嵌入式存储器的自检和修复单元中很有用。本文定义了任意阈值的饱和计数器，并给出了几种实现方法。然后使用0.25/spl mu/m单元库估计所提出的设计方案的延迟和面积。最后，我们研究了阈值接近输入比特数时饱和计数器的行为，即非饱和并行计数器的特殊情况。

引用次数: 0

Some functions computable with a fused-mac 一些可以用融合mac计算的函数

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.39

S. Boldo, J. Muller

The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable using a fused-mac. Then, we show that, with rounding to the nearest, the error of a fused-mac instruction is exactly representable as the sum of two floating-point numbers. We give an algorithm that computes that error.

融合的乘法累加指令(fusion -mac)在一些当前的处理器上可用，如Power PC或Itanium简化了一些计算。我们给出了一些浮点函数的例子(如ulp(x)或Nextafter(x, y))，或一些有用的测试，它们很容易使用融合mac计算。然后，我们证明了，在舍入到最接近的情况下，融合mac指令的误差可以精确地表示为两个浮点数的和。我们给出一个计算这个误差的算法。

引用次数: 31

The vector floating-point unit in a synergistic processor element of a CELL processor 在CELL处理器的协同处理器单元中的矢量浮点单元

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.45

S. M. Müller, C. Jacobi, H. Oh, K. Tran, S. Cottier, B. Michael, H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano, T. Machida, S. Dhong

The floating-point unit in the synergistic processor element of the 1st generation multi-core CELL processor is described. The FPU supports 4-way SIMD single precision and integer operations and 2-way SIMD double precision operations. The design required a high-frequency, low latency, power and area efficiency with primary application to the multimedia streaming workloads, such as 3D graphics. The FPU has 3 different latencies, optimizing the performance critical single precision FMA operations, which are executed with a 6-cycle latency at an 11FO4 cycle time. The latency includes the global forwarding of the result. These challenging performance, power, and area goals were achieved through the co-design of architecture and implementation with optimizations at all levels of the design. This paper focuses on the logical and algorithmic aspects of the FPU we developed, to achieve these goals.

描述了第一代多核CELL处理器的协同处理器元件中的浮点单元。FPU支持4路SIMD单精度和整数运算，支持2路SIMD双精度运算。该设计要求高频、低延迟、功耗和面积效率，主要应用于多媒体流工作负载，如3D图形。FPU具有3种不同的延迟，优化了性能关键的单精度FMA操作，这些操作在11FO4周期时间内以6周期延迟执行。延迟包括结果的全局转发。这些具有挑战性的性能、功耗和面积目标是通过在设计的所有级别进行优化的架构和实现的协同设计来实现的。本文重点介绍了我们开发的FPU的逻辑和算法方面，以实现这些目标。

引用次数: 65

Low latency digit-recurrence reciprocal and square-root reciprocal algorithm and architecture 低延迟数字递归倒数和平方根倒数算法和架构

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.29

E. Antelo, T. Lang, P. Montuschi, A. Nannarelli

The reciprocal and square-root reciprocal operations are important in several applications. For these operations, we present algorithms that combine a digit-by-digit module and one iteration of a quadratic-convergence approximation. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this way, both parts execute in an overlapped manner, so that the total number of cycles is about half of the number that would be required by the digit-by-digit part alone. Because of the approximation, correct rounding of the result cannot be obtained directly in all cases; we propose a variable-time implementation that produces the correctly rounded result with a small average overhead. Radix-4 implementations are described and have been synthesized. They achieve the same cycle time as the standard digit-by-digit implementation, resulting in a speed-up of about 2 and, because of the approximation part, the area factor is also about 2. We also show a combined implementation for both operations that has essentially the same complexity as that for square-root reciprocal alone.

倒数运算和平方根倒数运算在一些应用中很重要。对于这些运算，我们提出了一种结合数位模块和二次收敛近似的一次迭代的算法。后者是通过数位递归实现的，它使用由数位部分产生的数字。通过这种方式，两个部分以重叠的方式执行，因此总周期数大约是单独逐个数字部分所需周期数的一半。由于近似值的存在，不可能在所有情况下都直接得到结果的正确舍入;我们提出了一种可变时间实现，以较小的平均开销产生正确的四舍五入结果。对Radix-4的实现进行了描述并进行了综合。它们实现了与标准逐位实现相同的周期时间，导致大约2的加速，并且由于近似部分，面积因子也大约为2。我们还展示了两种运算的组合实现，其复杂度与单独求平方根倒数的复杂度相同。

{"title":"Low latency digit-recurrence reciprocal and square-root reciprocal algorithm and architecture","authors":"E. Antelo, T. Lang, P. Montuschi, A. Nannarelli","doi":"10.1109/ARITH.2005.29","DOIUrl":"https://doi.org/10.1109/ARITH.2005.29","url":null,"abstract":"The reciprocal and square-root reciprocal operations are important in several applications. For these operations, we present algorithms that combine a digit-by-digit module and one iteration of a quadratic-convergence approximation. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this way, both parts execute in an overlapped manner, so that the total number of cycles is about half of the number that would be required by the digit-by-digit part alone. Because of the approximation, correct rounding of the result cannot be obtained directly in all cases; we propose a variable-time implementation that produces the correctly rounded result with a small average overhead. Radix-4 implementations are described and have been synthesized. They achieve the same cycle time as the standard digit-by-digit implementation, resulting in a speed-up of about 2 and, because of the approximation part, the area factor is also about 2. We also show a combined implementation for both operations that has essentially the same complexity as that for square-root reciprocal alone.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127138825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Parallel Montgomery multiplication in GF(2/sup k/) using trinomial residue arithmetic 在GF(2/sup k/)中使用三项式剩余算法的平行蒙哥马利乘法

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.34

J. Bajard, L. Imbert, G. Jullien

We propose the first general multiplication algorithm in GF(2/sup k/) with a subquadratic area complexity of O(k/sup 8/5/) = O(k/sup 1.6/). Using the Chinese remainder theorem, we represent the elements of GF(2/sup k/); i.e. the polynomials in GF(2) [X] of degree at most k-1, by their remainder modulo a set of n pairwise prime trinomials, T/sub 1/,...,T/sub n/, of degree d and such that nd /spl ges/ k. Our algorithm is based on Montgomery's multiplication applied to the ring formed by the direct product of the trinomials.

我们提出了GF(2/sup k/)中的第一个通用乘法算法，其次二次面积复杂度为O(k/sup 8/5/) = O(k/sup 1.6/)。利用中国剩余定理，我们表示GF(2/sup k/)的元素;即GF(2) [X]中阶不超过k-1的多项式，通过它们的余数对n组成对质数三项式T/下标1/，…，T/下标n/，次d，使得nd /spl等于/ k。我们的算法是基于蒙哥马利乘法应用于由三项式的直积形成的环。

引用次数: 34

A linear-system operator based scheme for evaluation of multinomials 基于线性系统算子的多项式求值方案

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.8

P. Adharapurapu, M. Ercegovac

We present a radix-2 online computational scheme for evaluating multinomials in a fixed-point number representation system. Its main advantage is that it can adapt to any evaluation graph representing the multinomial. Evaluation graphs are efficient representations of multinomials in a factored form. The proposed scheme maps subgraphs of the evaluation graph using linear-system operators. These operators transform the expressions represented by the subgraphs into systems of linear equations. The linear equations are then solved in an online, most-significant-digit-first fashion. The scheme produces, after an initial delay, one output digit per iteration for inputs within range. The iteration time is equal to the sum of the delays of a redundant adder, multiplexer, register and a selection unit and is independent of the size of the multinomial and the precision of the inputs/outputs. The initial delay is proportional to the diameter of the evaluation graph and the maximum number of children of any addition node in the graph. The proposed method lends itself to implementation using simple, highly regular hardware with serial interconnections between modules.

提出了一种计算定点数表示系统中多项式的基数-2在线计算方案。它的主要优点是可以适应任何表示多项式的评价图。求值图是多项式以因子形式的有效表示。该方案利用线性系统算子映射评价图的子图。这些运算符将子图表示的表达式转换成线性方程组。然后以在线、最高有效数字优先的方式求解线性方程。该方案在初始延迟后，为范围内的输入每次迭代产生一个输出数字。迭代时间等于冗余加器、多路器、寄存器和选择单元的延迟之和，与多项式的大小和输入/输出的精度无关。初始延迟与评估图的直径和图中任何添加节点的最大子节点数成正比。所提出的方法可以使用简单、高度规则的硬件实现，并在模块之间进行串行互连。

{"title":"A linear-system operator based scheme for evaluation of multinomials","authors":"P. Adharapurapu, M. Ercegovac","doi":"10.1109/ARITH.2005.8","DOIUrl":"https://doi.org/10.1109/ARITH.2005.8","url":null,"abstract":"We present a radix-2 online computational scheme for evaluating multinomials in a fixed-point number representation system. Its main advantage is that it can adapt to any evaluation graph representing the multinomial. Evaluation graphs are efficient representations of multinomials in a factored form. The proposed scheme maps subgraphs of the evaluation graph using linear-system operators. These operators transform the expressions represented by the subgraphs into systems of linear equations. The linear equations are then solved in an online, most-significant-digit-first fashion. The scheme produces, after an initial delay, one output digit per iteration for inputs within range. The iteration time is equal to the sum of the delays of a redundant adder, multiplexer, register and a selection unit and is independent of the size of the multinomial and the precision of the inputs/outputs. The initial delay is proportional to the diameter of the evaluation graph and the maximum number of children of any addition node in the graph. The proposed method lends itself to implementation using simple, highly regular hardware with serial interconnections between modules.","PeriodicalId":194902,"journal":{"name":"17th IEEE Symposium on Computer Arithmetic (ARITH'05)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Single precision reciprocals by multipartite table lookup 单精度往复多部表查找

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.37

Peter Kornerup, D. Matula

We develop the foundations for confirming monotonicity of a multi-term reciprocal function approximation. We introduce the concept of operand recoding to improve the accuracy of multipartite approximation. The results are applied to provide a proposed four-partite reciprocal implementation with total table size /spl sim/27 Kbytes, that yields an IEEE standard, single precision sized format (24 bit) reciprocal instruction, that is a one-ulp monotonic reciprocal.

建立了确认多项倒数函数近似单调性的基础。为了提高多部近似的精度，我们引入了操作数重新编码的概念。结果被应用于提供一个提议的四部分互反实现，总表大小/spl sim/ 27kbytes，产生一个IEEE标准，单精度大小格式(24位)的互反指令，这是一个单阶单调互反。

引用次数: 6

Efficient function approximation using truncated multipliers and squarers 使用截断乘数和平方的有效函数逼近

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

Pub Date : 2005-06-27 DOI: 10.1109/ARITH.2005.18

E. G. Walters, M. Schulte

This paper presents a technique for designing linear and quadratic interpolators for function approximation using truncated multipliers and squarers. Initial coefficient values are found using a Chebyshev series approximation, and then adjusted through exhaustive simulation to minimize the maximum absolute error of the interpolator output. This technique is suitable for any function and any precision up to 24-bits (IEEE single precision). Designs for linear and quadratic interpolators that implement the reciprocal function, f(x)=1/x, are presented and analyzed as an example. We show that a 24-bit truncated reciprocal quadratic interpolator with a design specification /spl plusmn/1 ulp error requires 24.1% fewer partial products to implement than a comparable standard interpolator with the same error specification.

本文提出了一种利用截断乘法器和平方器设计函数逼近线性和二次插值器的技术。采用切比雪夫级数逼近法确定初始系数值，然后通过穷极仿真进行调整，使插值器输出的最大绝对误差最小化。该技术适用于24位以内(IEEE单精度)的任意功能和任意精度。给出了实现倒数函数f(x)=1/x的线性插值器和二次插值器的设计，并作为实例进行了分析。我们表明，与具有相同误差规格的可比标准插补器相比，具有设计规范/spl plusmn/1 ulp误差的24位截断倒数二次插补器需要的部分积减少24.1%。

引用次数: 62

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

17th IEEE Symposium on Computer Arithmetic (ARITH'05)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀