Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang
{"title":"An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform","authors":"Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang","doi":"10.1145/3337821.3337846","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.