Systolic Array Based Convolutional Neural Network Inference on FPGA

2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2022-12-01 DOI:10.1109/MCSoC57363.2022.00029

Shi Hui Chua, T. Teo, Mulat Ayinet Tiruye, I-Chyn Wey

{"title":"Systolic Array Based Convolutional Neural Network Inference on FPGA","authors":"Shi Hui Chua, T. Teo, Mulat Ayinet Tiruye, I-Chyn Wey","doi":"10.1109/MCSoC57363.2022.00029","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNNs) possess a particular edge over its predecessor, the Multi-Layer Perceptron (MLP). This is due to its weight sharing features that allows the CNN to use less parameters for the same number of outputs as compared to the MLP. Systolic arrays capitalize on the weight sharing property of CNNs to do data reuse while performing convolutional operations, in order to reduce the power consumption from the memory accesses. A kernel fitting systolic processing element array was designed with only positive multiplication to increase the throughput and power efficiency of the CNN accelerator, while using weight stationary dataflow to achieve data reuse in the systolic array. A cost-optimized lightweight solution is implemented through low-cost FPGA hardware so as to allow for greater accessibility. The CNN accelerator consumes 0.363 W power at 100 MHz operating frequency. A peak throughput of 10.98 GOps/s was achieved with peak performance density of 0.200 GOps/s/DSP and peak power efficiency of 30.26 GOps/s/W. Even with the added support for additional functions, proposed design achieved up to 1.59x better power efficiency compared to other systolic implementations and up to 6.17x better power efficiency compared to non-systolic implementations.","PeriodicalId":150801,"journal":{"name":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MCSoC57363.2022.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Convolutional Neural Networks (CNNs) possess a particular edge over its predecessor, the Multi-Layer Perceptron (MLP). This is due to its weight sharing features that allows the CNN to use less parameters for the same number of outputs as compared to the MLP. Systolic arrays capitalize on the weight sharing property of CNNs to do data reuse while performing convolutional operations, in order to reduce the power consumption from the memory accesses. A kernel fitting systolic processing element array was designed with only positive multiplication to increase the throughput and power efficiency of the CNN accelerator, while using weight stationary dataflow to achieve data reuse in the systolic array. A cost-optimized lightweight solution is implemented through low-cost FPGA hardware so as to allow for greater accessibility. The CNN accelerator consumes 0.363 W power at 100 MHz operating frequency. A peak throughput of 10.98 GOps/s was achieved with peak performance density of 0.200 GOps/s/DSP and peak power efficiency of 30.26 GOps/s/W. Even with the added support for additional functions, proposed design achieved up to 1.59x better power efficiency compared to other systolic implementations and up to 6.17x better power efficiency compared to non-systolic implementations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于收缩阵列的卷积神经网络推理FPGA

卷积神经网络(cnn)比其前身多层感知器(MLP)具有特殊的优势。这是由于与MLP相比，它的权重共享特性允许CNN使用更少的参数来获得相同数量的输出。收缩数组利用cnn的权值共享特性，在执行卷积运算的同时进行数据重用，以减少内存访问的功耗。为了提高CNN加速器的吞吐量和功率效率，设计了一种仅采用正乘法的核拟合收缩处理单元阵列，同时使用权值固定的数据流实现收缩阵列中的数据重用。通过低成本的FPGA硬件实现了成本优化的轻量级解决方案，从而允许更大的可访问性。工作频率为100mhz时，CNN加速器功耗为0.363 W。峰值吞吐量为10.98 GOps/s，峰值性能密度为0.200 GOps/s/DSP，峰值功率效率为30.26 GOps/s/W。即使增加了对其他功能的支持，与其他收缩实现相比，所提出的设计的功率效率提高了1.59倍，与非收缩实现相比，功率效率提高了6.17倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

自引率

0.00%

发文量