BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic Arrays

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computers Pub Date : 2024-08-23 DOI:10.1109/TC.2024.3449103

Mingeon Park;Seokjin Hwang;Hyungmin Cho

{"title":"BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic Arrays","authors":"Mingeon Park;Seokjin Hwang;Hyungmin Cho","doi":"10.1109/TC.2024.3449103","DOIUrl":null,"url":null,"abstract":"Depthwise convolution (DWConv) is an effective technique for reducing the size and computational requirements of convolutional neural networks. However, DWConv's input reuse pattern is not easily transformed into dense matrix multiplications, leading to low utilization of processing elements (PEs) on existing systolic arrays. In this paper, we introduce a novel systolic array dataflow mechanism called \n<i>BiRD</i>\n, designed to maximize input reuse and boost DWConv performance. BiRD utilizes two directions of input reuse and necessitates only minor modifications to a typical weight-stationary type systolic array. We evaluate BiRD on the Gemmini platform, comparing it with existing dataflow types. The results demonstrate that BiRD achieves significant performance improvements in computation time reduction, while incurring minimal area overhead and improved energy consumption compared to other dataflow types. For example, on a 32\n<inline-formula><tex-math>$\\times{}$</tex-math></inline-formula>\n32 systolic array, it results in a 9.8% area overhead, significantly smaller than other dataflow types for DWConv. Compared to matrix multiplication-based DWConv, BiRD achieves a 4.7\n<inline-formula><tex-math>$\\times{}$</tex-math></inline-formula>\n performance improvement for DWConv layers of MobileNet-V2, resulting in a 55.8% reduction in total inference computation time and a 44.9% reduction in energy consumption. Our results highlight the effectiveness of BiRD in enhancing the performance of DWConv on systolic arrays.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2708-2721"},"PeriodicalIF":3.8000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10644120/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Depthwise convolution (DWConv) is an effective technique for reducing the size and computational requirements of convolutional neural networks. However, DWConv's input reuse pattern is not easily transformed into dense matrix multiplications, leading to low utilization of processing elements (PEs) on existing systolic arrays. In this paper, we introduce a novel systolic array dataflow mechanism called BiRD , designed to maximize input reuse and boost DWConv performance. BiRD utilizes two directions of input reuse and necessitates only minor modifications to a typical weight-stationary type systolic array. We evaluate BiRD on the Gemmini platform, comparing it with existing dataflow types. The results demonstrate that BiRD achieves significant performance improvements in computation time reduction, while incurring minimal area overhead and improved energy consumption compared to other dataflow types. For example, on a 32

$\times{}$

32 systolic array, it results in a 9.8% area overhead, significantly smaller than other dataflow types for DWConv. Compared to matrix multiplication-based DWConv, BiRD achieves a 4.7

$\times{}$

performance improvement for DWConv layers of MobileNet-V2, resulting in a 55.8% reduction in total inference computation time and a 44.9% reduction in energy consumption. Our results highlight the effectiveness of BiRD in enhancing the performance of DWConv on systolic arrays.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BiRD：用于提高收缩阵列深度卷积性能的双向输入重复使用数据流

深度卷积（DWConv）是减少卷积神经网络规模和计算要求的有效技术。然而，DWConv 的输入重用模式不容易转化为密集矩阵乘法，导致现有系统阵列的处理元件（PE）利用率较低。在本文中，我们介绍了一种名为 BiRD 的新型收缩阵列数据流机制，旨在最大限度地提高输入重用率，提升 DWConv 性能。BiRD 利用两个方向的输入重用，只需对典型的权重静态型收缩阵列稍作修改即可。我们在 Gemmini 平台上对 BiRD 进行了评估，并将其与现有的数据流类型进行了比较。结果表明，与其他数据流类型相比，BiRD 在减少计算时间方面实现了显著的性能提升，同时产生的面积开销最小，能耗也有所改善。例如，在一个 32$\times{}$32 的收缩阵列上，BiRD 的面积开销为 9.8%，明显小于 DWConv 的其他数据流类型。与基于矩阵乘法的 DWConv 相比，BiRD 使 MobileNet-V2 的 DWConv 层性能提高了 4.7$\times{}$，推理计算总时间减少了 55.8%，能耗降低了 44.9%。我们的研究结果凸显了 BiRD 在提高 DWConv 在收缩阵列上的性能方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.

期刊最新文献

2025 Reviewers List Evaluation of Radiation Resilience, Performance, and Vmin of Sub-3 nm FSFET Based SRAM Arrays Dual-Pronged Deep Learning Preprocessing on Heterogeneous Platforms With CPU, Accelerator and CSD Latency Optimization in Hybrid Memory System for GNNs Fused FP8 Many-Terms Dot Product With Scaling and FP32 Accumulation