A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse

Sang Min Lee;Hanjoon Kim;Jeseung Yeon;Juyun Lee;Younggeun Choi;Minho Kim;Changjae Park;Kiseok Jang;Youngsik Kim;Yongseung Kim;Changman Lee;Hyuck Han;Won Eung Kim;Rui Tang;Joon Ho Baek
{"title":"A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse","authors":"Sang Min Lee;Hanjoon Kim;Jeseung Yeon;Juyun Lee;Younggeun Choi;Minho Kim;Changjae Park;Kiseok Jang;Youngsik Kim;Yongseung Kim;Changman Lee;Hyuck Han;Won Eung Kim;Rui Tang;Joon Ho Baek","doi":"10.1109/OJSSCS.2022.3216798","DOIUrl":null,"url":null,"abstract":"For energy-efficient accelerators in data centers that leverage advances in the performance and energy efficiency of recent algorithms, flexible architectures are critical to support state-of-the-art algorithms for various deep learning tasks. Due to the matrix multiplication units at the core of tensor operations, most recent programmable architectures lack flexibility for layers with diminished dimensions, especially for inferences where a large batch axis is rarely allowed. In addition, exploiting the data reuse inherent within tensor operations for computing a single matrix multiplication is challenging. In this work, an extension of a vector processor in 14 nm is proposed, which is customized to tensor operations. The flexible architecture enables a tensorized loop to support various data layouts and different shapes and sizes of tensor operations. It also exploits all possible data reuse, including input, weight, and output. Based on the tensorized loop, fetch and reduction networks, which unicast or multicast with the ordering of both input data and processing data, can be simplified using a circuit-switching-like network with configured topology and flow control for each tensor operation. Two processing elements can be fused to optimize latency for a large model or can operate individually for throughput. As a result, various state-of-the-art models can be processed efficiently with straightforward compiler optimization, and the highest energy efficiency of 13.4Inferences/s/W on EfficientNetV2-S is demonstrated.","PeriodicalId":100633,"journal":{"name":"IEEE Open Journal of the Solid-State Circuits Society","volume":"2 ","pages":"219-230"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8782712/9733783/09927346.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Solid-State Circuits Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/9927346/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

For energy-efficient accelerators in data centers that leverage advances in the performance and energy efficiency of recent algorithms, flexible architectures are critical to support state-of-the-art algorithms for various deep learning tasks. Due to the matrix multiplication units at the core of tensor operations, most recent programmable architectures lack flexibility for layers with diminished dimensions, especially for inferences where a large batch axis is rarely allowed. In addition, exploiting the data reuse inherent within tensor operations for computing a single matrix multiplication is challenging. In this work, an extension of a vector processor in 14 nm is proposed, which is customized to tensor operations. The flexible architecture enables a tensorized loop to support various data layouts and different shapes and sizes of tensor operations. It also exploits all possible data reuse, including input, weight, and output. Based on the tensorized loop, fetch and reduction networks, which unicast or multicast with the ordering of both input data and processing data, can be simplified using a circuit-switching-like network with configured topology and flow control for each tensor operation. Two processing elements can be fused to optimize latency for a large model or can operate individually for throughput. As a result, various state-of-the-art models can be processed efficiently with straightforward compiler optimization, and the highest energy efficiency of 13.4Inferences/s/W on EfficientNetV2-S is demonstrated.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
具有可重构获取网络和处理融合的14nm 64-TOPS节能张量加速器
对于利用最新算法的性能和能效进步的数据中心节能加速器来说,灵活的架构对于支持各种深度学习任务的最先进算法至关重要。由于张量运算的核心是矩阵乘法单元,最新的可编程体系结构对于维度减少的层缺乏灵活性,尤其是对于很少允许使用大批量轴的推断。此外,利用张量运算中固有的数据重用来计算单个矩阵乘法也是一项挑战。在这项工作中,提出了矢量处理器在14nm的扩展,该扩展被定制为张量运算。灵活的架构使张量化循环能够支持各种数据布局以及不同形状和大小的张量运算。它还利用了所有可能的数据重用,包括输入、权重和输出。基于张量化环路,可以使用具有针对每个张量操作配置的拓扑和流控制的类似电路交换的网络来简化具有输入数据和处理数据的排序的单播或多播的提取和缩减网络。两个处理元件可以融合以优化大型模型的延迟,或者可以单独操作以获得吞吐量。因此,通过简单的编译器优化,可以有效地处理各种最先进的模型,并在EfficientNetV2-s上展示了13.4Inferences/s/W的最高能效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The Problem of Spurious Emissions in 5G FR2 Phased Arrays, and a Solution Based on an Upmixer With Embedded LO Leakage Cancellation SAR-Assisted Energy-Efficient Hybrid ADCs Systematic Equation-Based Design of a 10-Bit, 500-MS/s Single-Channel SAR A/D Converter With 2-GHz Resolution Bandwidth Digital Phase-Locked Loops: Exploring Different Boundaries 8-Shaped Inductors: An Essential Addition to RFIC Designers’ Toolbox
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1