A 280mV-to-1.1V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22nm CMOS

S. Hsu, A. Agarwal, M. Anders, S. Mathew, Himanshu Kaul, F. Sheikh, R. Krishnamurthy
{"title":"A 280mV-to-1.1V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22nm CMOS","authors":"S. Hsu, A. Agarwal, M. Anders, S. Mathew, Himanshu Kaul, F. Sheikh, R. Krishnamurthy","doi":"10.1109/ISSCC.2012.6176966","DOIUrl":null,"url":null,"abstract":"Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.","PeriodicalId":255282,"journal":{"name":"2012 IEEE International Solid-State Circuits Conference","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Solid-State Circuits Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2012.6176966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 46

Abstract

Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于22nm CMOS的280mv -1.1 v 256b可重构SIMD矢量置换引擎
节能SIMD排列操作是在多媒体、图形和信号处理工作负载中最大化高性能微处理器矢量数据路径利用率的关键[1-3]。为了在大型数据集上实现高吞吐量的数据重排操作,需要一个宽SIMD矢量排列引擎,并具有可缩放的电源电压以提供高能效。采用22nm CMOS工艺制备了一种超低电压可重构的4路到32路SIMD矢量置换引擎,该引擎由一个32入口× 256b 3读1写端口寄存器文件和一个256b字节任意对任意置换交叉条组成,用于二维置换。该寄存器文件集成了跨多个条目的垂直shuffle到读/写操作中,并包括无时钟静态读取和共享P/N双端传输门(DETG)写入,通过PVT变化将寄存器文件VMIN提高了250mV,动态工作范围为280mV-1.1V。该置换交叉棒实现了交错折叠字节多路复用器布局,形成任意到任意全连接树,通过置换累积电路执行水平置换,包括矢量触发器、堆叠最小延迟缓冲器、平均最小尺寸晶体管变化的共享门,以及将逻辑VMIN提高150mV的超低电压分频输出(ULVS)电平移位器,同时在260mV、50°C下实现585GOPS/W的峰值能量效率。排列引擎占据0.048mm2的密集布局(图10.1.7),同时实现:(i)在0.9V, 50°C下测量的标称寄存器文件性能为1.8GHz, 106mW;(ii)稳健的寄存器文件功能,低至280mV(亚阈值),峰值能量效率为154GOPS/W;(iii)在1.1V、50°C、240mV、10MHz、功耗19μW的深度亚阈值工作条件下,测量2.9GHz、69mW的可扩展permute crossbar性能;(iv)一种64b 4×4矩阵转置算法,在1.8GHz, 0.9V下的峰值吞吐量为263Gbps,节能53%,提高42%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A 464GOPS 620GOPS/W heterogeneous multi-core SoC for image-recognition applications A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth A 2.7nJ/b multi-standard 2.3/2.4GHz polar transmitter for wireless sensor networks A 60GHz outphasing transmitter in 40nm CMOS with 15.6dBm output power A capacitance-to-digital converter for displacement sensing with 17b resolution and 20μs conversion time
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1