ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

IF 1.5 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Architecture and Code Optimization Pub Date : 2024-03-21 DOI:10.1145/3653363
Ching-Jui Lee, Tsung Tai Yeh
{"title":"ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors","authors":"Ching-Jui Lee, Tsung Tai Yeh","doi":"10.1145/3653363","DOIUrl":null,"url":null,"abstract":"<p>Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model. </p><p>This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"19 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3653363","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model.

This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ReSA:用于多个微小 DNN 张量的可重构收缩阵列
收缩阵列架构大大加快了深度神经网络(DNN)的速度。收缩阵列由多个可执行乘积(MAC)的处理元件(PE)组成。传统上,收缩阵列可以在每个周期同时执行与收缩阵列大小相匹配的一定量的张量数据。然而,DNN 模型的超参数在每一层都不同,导致每一层的张量大小各异。将这些不规则张量映射到收缩阵列,同时充分利用收缩阵列中的整个 PE,是一项挑战。此外,现代 DNN 收缩加速器通常采用单一数据流。然而,这种数据流并不是每个 DNN 模型的最佳数据流。本研究提出的 ReSA 是一种可重新配置的数据流架构,旨在通过在空间分区的收缩阵列上映射微小的张量,最大限度地缩短 DNN 模型的执行时间。与传统的收缩阵列架构不同,ReSA 数据路径控制器可在 PE 上执行输入、权重和输出静态数据流。ReSA 还将粗粒度收缩阵列分解为多个小阵列,以减少张量映射的碎片问题。每个小的收缩子阵列单元依靠我们的数据仲裁器,通过简单的互连网络相互调度张量。此外,ReSA 还重新安排了内存访问顺序,使内存加载和执行阶段重叠,从而在处理微小张量时隐藏内存延迟。最后,ReSA 将每一层的张量分割成多个小张量,并在主机端为每个张量搜索最佳数据流。然后,ReSA 将预定义的数据流编码到我们提出的指令中,通知收缩阵列正确切换数据流。结果,在 9 个不同的 DNN 模型中,我们对收缩阵列架构的优化比权重静态收缩阵列架构的几何平均速度提高了 1.87 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization 工程技术-计算机:理论方法
CiteScore
3.60
自引率
6.20%
发文量
78
审稿时长
6-12 weeks
期刊介绍: ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.
期刊最新文献
A Survey of General-purpose Polyhedral Compilers Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1