一种基于时域二维OaA的卷积神经网络加速器

Memories - Materials, Devices, Circuits and Systems Pub Date : 2023-07-01 DOI:10.1016/j.memori.2023.100041

Rudresh Pratap Singh , Shreyam Kumar , Jugal Gandhi , Diksha Shekhawat , M. Santosh , Jai Gopal Pandey

{"title":"一种基于时域二维OaA的卷积神经网络加速器","authors":"Rudresh Pratap Singh , Shreyam Kumar , Jugal Gandhi , Diksha Shekhawat , M. Santosh , Jai Gopal Pandey","doi":"10.1016/j.memori.2023.100041","DOIUrl":null,"url":null,"abstract":"<div><p>Convolutional neural networks (CNNs) are widely implemented in modern facial recognition systems for image recognition applications. Runtime speed is a critical parameter for real-time systems. Traditional FPGA-based accelerations require either large on-chip memory or high bandwidth and high memory access time that slow down the network. The proposed work uses an algorithm and its subsequent hardware design for a quick CNN computation using an overlap-and-add-based technique in the time domain. In the algorithm, the input images are broken into tiles that can be processed independently without computing overhead in the frequency domain. This also allows for efficient concurrency of the convolution process, resulting in higher throughput and lower power consumption. At the same time, we maintain low on-chip memory requirements necessary for faster and cheaper processor designs. We implemented CNN VGG-16 and AlexNet models with our design on Xilinx Virtex-7 and Zynq boards. The performance analysis of our design provides 48% better throughput than the state-of-the-art AlexNet and uses 68.85% lesser multipliers and other resources than the state-of-the-art VGG-16.</p></div>","PeriodicalId":100915,"journal":{"name":"Memories - Materials, Devices, Circuits and Systems","volume":"4 ","pages":"Article 100041"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A time domain 2D OaA-based convolutional neural networks accelerator\",\"authors\":\"Rudresh Pratap Singh , Shreyam Kumar , Jugal Gandhi , Diksha Shekhawat , M. Santosh , Jai Gopal Pandey\",\"doi\":\"10.1016/j.memori.2023.100041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Convolutional neural networks (CNNs) are widely implemented in modern facial recognition systems for image recognition applications. Runtime speed is a critical parameter for real-time systems. Traditional FPGA-based accelerations require either large on-chip memory or high bandwidth and high memory access time that slow down the network. The proposed work uses an algorithm and its subsequent hardware design for a quick CNN computation using an overlap-and-add-based technique in the time domain. In the algorithm, the input images are broken into tiles that can be processed independently without computing overhead in the frequency domain. This also allows for efficient concurrency of the convolution process, resulting in higher throughput and lower power consumption. At the same time, we maintain low on-chip memory requirements necessary for faster and cheaper processor designs. We implemented CNN VGG-16 and AlexNet models with our design on Xilinx Virtex-7 and Zynq boards. The performance analysis of our design provides 48% better throughput than the state-of-the-art AlexNet and uses 68.85% lesser multipliers and other resources than the state-of-the-art VGG-16.</p></div>\",\"PeriodicalId\":100915,\"journal\":{\"name\":\"Memories - Materials, Devices, Circuits and Systems\",\"volume\":\"4 \",\"pages\":\"Article 100041\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Memories - Materials, Devices, Circuits and Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S277306462300018X\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Memories - Materials, Devices, Circuits and Systems","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S277306462300018X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

卷积神经网络（CNNs）在现代人脸识别系统中被广泛应用于图像识别应用。运行时速度是实时系统的一个关键参数。传统的基于FPGA的加速需要大的片上存储器或高带宽和高存储器访问时间，这会减慢网络速度。所提出的工作使用一种算法及其后续硬件设计，在时域中使用基于重叠和加法的技术进行快速CNN计算。在该算法中，输入图像被分解成可以独立处理的瓦片，而无需频域中的计算开销。这也允许卷积过程的高效并发，从而获得更高的吞吐量和更低的功耗。同时，我们保持较低的片上存储器需求，这是更快、更便宜的处理器设计所必需的。我们在Xilinx Virtex-7和Zynq板上实现了CNN VGG-16和AlexNet模型。我们设计的性能分析提供了比最先进的AlexNet高48%的吞吐量，并使用了比最新的VGG-16少68.85%的乘法器和其他资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A time domain 2D OaA-based convolutional neural networks accelerator

Convolutional neural networks (CNNs) are widely implemented in modern facial recognition systems for image recognition applications. Runtime speed is a critical parameter for real-time systems. Traditional FPGA-based accelerations require either large on-chip memory or high bandwidth and high memory access time that slow down the network. The proposed work uses an algorithm and its subsequent hardware design for a quick CNN computation using an overlap-and-add-based technique in the time domain. In the algorithm, the input images are broken into tiles that can be processed independently without computing overhead in the frequency domain. This also allows for efficient concurrency of the convolution process, resulting in higher throughput and lower power consumption. At the same time, we maintain low on-chip memory requirements necessary for faster and cheaper processor designs. We implemented CNN VGG-16 and AlexNet models with our design on Xilinx Virtex-7 and Zynq boards. The performance analysis of our design provides 48% better throughput than the state-of-the-art AlexNet and uses 68.85% lesser multipliers and other resources than the state-of-the-art VGG-16.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Memories - Materials, Devices, Circuits and Systems

自引率

0.00%

发文量