Latency Sensitive FMA Design

2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI:10.1109/ARITH.2011.26

Sameh Galal, M. Horowitz

引用次数: 12

Abstract

The implementation of merged floating-point multiply-add operations can be optimized in many ways. For latency sensitive applications, our cascade design reduces the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in FP stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average improvement in CPI in 2-way machines and 4.6% in 4-way machines. The cascade design has the same area and energy budget as a traditional fused multiple-add FMA.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

延迟敏感FMA设计

合并浮点乘加运算的实现可以通过多种方式进行优化。对于延迟敏感型应用，我们的级联设计比融合设计减少了2倍的累积相关延迟，但代价是非累积相关延迟增加了13%。一个简单的顺序执行模型表明，这种设计在大多数应用程序中都是优越的，平均减少了12%的FP延迟，并将性能提高了6%。对超标量无序机器的模拟显示，在2路机器中CPI平均提高了4%，在4路机器中提高了4.6%。该级联设计与传统的融合多加FMA具有相同的面积和能量预算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE 20th Symposium on Computer Arithmetic

自引率

0.00%

发文量

期刊最新文献

Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines A 1.5 Ghz VLIW DSP CPU with Integrated Floating Point and Fixed Point Instructions in 40 nm CMOS Flocq: A Unified Library for Proving Floating-Point Algorithms in Coq Teraflop FPGA Design Self Checking in Current Floating-Point Units