Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI:10.1145/3020078.3021727

Chi Zhang, V. Prasanna

{"title":"Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System","authors":"Chi Zhang, V. Prasanna","doi":"10.1145/3020078.3021727","DOIUrl":null,"url":null,"abstract":"We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping latency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ~54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel QuickAssist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation, our design has 0.66x GFLOPs/sec using 3.48x less multipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"140","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021727","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 140

Abstract

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping latency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ~54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel QuickAssist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation, our design has 0.66x GFLOPs/sec using 3.48x less multipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

卷积神经网络在CPU-FPGA共享存储系统上的频域加速

本文提出了一种基于CPU-FPGA的卷积神经网络(cnn)协同共享内存加速机制。首先，我们利用快速傅里叶变换(FFT)和重叠和添加(OaA)来减少卷积层的计算需求。我们将频域算法映射到FPGA上高度并行的基于oaa的二维卷积器设计上。然后，我们提出了一种新的共享内存数据布局，以实现CPU和FPGA之间的高效数据通信。为了减少存储器访问延迟并维持FPGA的峰值性能，我们的设计采用了双缓冲。为了减少层间数据重映射的延迟，我们利用了CPU和FPGA的并发处理。我们的方法可以应用于小于所选FFT大小的任何内核大小，并使用适当的零填充，从而加速各种CNN模型。我们利用基于oaa的二维卷积器的数据并行性和任务并行性来扩展整个系统的性能。通过使用OaA，现有cnn的浮点运算次数减少了39.14% ~54.10%。我们在Intel QuickAssist QPI FPGA平台上实现了VGG16、AlexNet和GoogLeNet。这些设计分别维持123.48 GFLOPs/sec、83.00 GFLOPs/sec和96.60 GFLOPs/sec。与最先进的AlexNet实现相比，我们的设计实现了1.35倍GFLOPs/秒的改进，使用了3.33倍的乘子和1.1倍的内存。与最先进的VGG16实现相比，我们的设计在不影响分类精度的情况下使用了3.48倍的乘法器，实现了0.66倍的GFLOPs/秒。对于GoogLeNet的实现，我们的设计与在2.8 GHz的10核英特尔至强处理器上运行16个线程相比，性能提高了5.56倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量