An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI:10.1145/3337821.3337846

Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang

{"title":"An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform","authors":"Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang","doi":"10.1145/3337821.3337846","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在多fpga平台上加速复杂连接cnn的高效设计流程

卷积神经网络(cnn)在各种计算机视觉任务上取得了令人印象深刻的表现。为了获得更好的性能，最近提出了一些复杂连接的CNN模型(例如GoogLeNet和DenseNet)，并在图像分类和分割领域取得了最先进的性能。然而，cnn是计算和内存密集型的。因此，为了加速cnn的推理和训练过程，开发硬件加速器具有重要意义。由于现场可编程门阵列(fpga)的高性能、可重构和节能特性，许多基于fpga的加速器被提出来实现cnn，并取得了更高的吞吐量和能效。然而，复杂连接CNN模型所涉及的大量参数已经超出了单个FPGA板有限的硬件资源，无法满足映射整个CNN模型所带来的内存和计算资源需求。因此，在本文中，我们提出了一个完整的设计流程，以加速多fpga平台上复杂连接cnn的推理，包括DAG抽象，映射方案生成和设计空间探索。此外，还提出了一种具有灵活fpga间通信的多fpga系统，以有效地支持我们的设计流程。代表性模型的实验结果表明，与CPU和GPU解决方案相比，所提出的多fpga系统设计可实现高达145.2倍和2.5倍的吞吐量加速，与多核CPU和GPU解决方案相比，可实现高达139.1倍和4.8倍的能效提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 48th International Conference on Parallel Processing

自引率

0.00%

发文量

期刊最新文献

Express Link Placement for NoC-Based Many-Core Platforms Cartesian Collective Communication Artemis A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs diBELLA: Distributed Long Read to Long Read Alignment