Split-Path Fused Floating Point Multiply Accumulate (FPMAC)

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI:10.1109/ARITH.2013.32

S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla

{"title":"Split-Path Fused Floating Point Multiply Accumulate (FPMAC)","authors":"S. Srinivasan, Ketan Bhudiya, R. Ramanarayanan, P. Babu, Tiju Jacob, S. Mathew, R. Krishnamurthy, V. Erraguntla","doi":"10.1109/ARITH.2013.32","DOIUrl":null,"url":null,"abstract":"Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st Symposium on Computer Arithmetic","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ARITH.2013.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分路融合浮点乘法累加(FPMAC)

浮点乘积单元是现代处理器的骨干，是决定微处理器频率、功耗和面积的关键电路。FPMAC单元在当代客户端微处理器中广泛使用，随着ISA对AVX和SSE等指令的支持，FPMAC单元进一步扩散，并广泛用于工程和科学应用的服务器处理器。因此，FPMAC的设计是至关重要的考虑因素，因为它在此类系统中主导着功率和性能权衡决策。在这项工作中，我们展示了一种新颖的FPMAC设计，它专注于关键路径上的最佳计算，因此使其成为目前文献中最快的FPMAC设计。该设计以隔离和优化FPMAC运行中的关键路径计算为前提。在这项工作中，我们有三个关键创新来创建一种新的双精度FPMAC，在时序关键路径上具有最少的门级:a)根据指数差拆分远近路径(d=Exy-Ez ={-2， -1, 0,1}为近路径，其余为远路径)，b)将近路径的累加提前注入到Wallace树中，以消除近路径关键逻辑中的3:2压缩器，利用近路径和稀疏Wallace树中的小对齐偏移进行53位尾数乘法。c)结合轮加和累加，以消除乘法器中的完成加法器，同时提供时间和功率优势。我们基于分割的设计为每个操作消耗更少的功率，其中每个操作只需要切换所需的逻辑。分割路径也为纯基于指数差分信号的逻辑门的未使用部分(近15-20%)的时钟或功率门提供了巨大的机会。我们还演示了对所有舍入模式的支持，以遵守双精度FPMAC的IEEE标准，这对于在当代处理器系列中使用该设计至关重要。演示的设计在时间上比最著名的IBM Power6[6]的硅实现高出14%，同时具有相似的面积，并且由于拆分处理而具有额外的功耗优势。该设计还与Lang等人[5]最著名的定时设计进行了比较，性能优于后者7%，而面积比后者小30%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 IEEE 21st Symposium on Computer Arithmetic

自引率

0.00%

发文量

期刊最新文献

Numerical Reproducibility and Accuracy at ExaScale Truncated Logarithmic Approximation Comparison between Binary64 and Decimal64 Floating-Point Numbers Split-Path Fused Floating Point Multiply Accumulate (FPMAC) Precision, Accuracy, and Rounding Error Propagation in Exascale Computing