Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文中文

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System 卷积神经网络在CPU-FPGA共享存储系统上的频域加速

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021727

Chi Zhang, V. Prasanna

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping latency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ~54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel QuickAssist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation, our design has 0.66x GFLOPs/sec using 3.48x less multipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.

本文提出了一种基于CPU-FPGA的卷积神经网络(cnn)协同共享内存加速机制。首先，我们利用快速傅里叶变换(FFT)和重叠和添加(OaA)来减少卷积层的计算需求。我们将频域算法映射到FPGA上高度并行的基于oaa的二维卷积器设计上。然后，我们提出了一种新的共享内存数据布局，以实现CPU和FPGA之间的高效数据通信。为了减少存储器访问延迟并维持FPGA的峰值性能，我们的设计采用了双缓冲。为了减少层间数据重映射的延迟，我们利用了CPU和FPGA的并发处理。我们的方法可以应用于小于所选FFT大小的任何内核大小，并使用适当的零填充，从而加速各种CNN模型。我们利用基于oaa的二维卷积器的数据并行性和任务并行性来扩展整个系统的性能。通过使用OaA，现有cnn的浮点运算次数减少了39.14% ~54.10%。我们在Intel QuickAssist QPI FPGA平台上实现了VGG16、AlexNet和GoogLeNet。这些设计分别维持123.48 GFLOPs/sec、83.00 GFLOPs/sec和96.60 GFLOPs/sec。与最先进的AlexNet实现相比，我们的设计实现了1.35倍GFLOPs/秒的改进，使用了3.33倍的乘子和1.1倍的内存。与最先进的VGG16实现相比，我们的设计在不影响分类精度的情况下使用了3.48倍的乘法器，实现了0.66倍的GFLOPs/秒。对于GoogLeNet的实现，我们的设计与在2.8 GHz的10核英特尔至强处理器上运行16个线程相比，性能提高了5.56倍。

{"title":"Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System","authors":"Chi Zhang, V. Prasanna","doi":"10.1145/3020078.3021727","DOIUrl":"https://doi.org/10.1145/3020078.3021727","url":null,"abstract":"We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping latency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ~54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel QuickAssist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation, our design has 0.66x GFLOPs/sec using 3.48x less multipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127217192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 140

Thermal Flattening in 3D FPGAs Using Embedded Cooling (Abstract Only) 基于嵌入式散热的3D fpga热平坦化(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021764

Girish Deshpande, D. Bhatia

Thermal management is one of the key concerns in modern high power density chips. A variety of thermal cooling techniques that have been in use in industrial applications are now also being applied to integrated circuits. In this work, we explore the integration of thermal aware CAD techniques with embedded cooling solutions to achieve smoother thermal profiles in 3D FPGAs. We also present some results on coolant temperatures and flow rates and their effect on thermal gradients on the chip.

热管理是现代高功率密度芯片的关键问题之一。在工业应用中使用的各种热冷却技术现在也被应用到集成电路中。在这项工作中，我们探索了热感知CAD技术与嵌入式冷却解决方案的集成，以在3D fpga中实现更平滑的热剖面。我们还介绍了一些关于冷却剂温度和流速及其对芯片热梯度的影响的结果。

引用次数: 0

FPGA Acceleration for Computational Glass-Free Displays 用于计算无玻璃显示器的FPGA加速

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021728

Zhuolun He, Guojie Luo

The increasing computational power enables various new applications that are runtime prohibitive before. FPGA is one of such computational power with both reconfigurability and energy efficiency. In this paper, we demonstrate the feasibility of eyeglasses-free displays through FPGA acceleration. Specifically, we propose several techniques to accelerate the sparse matrix-vector multiplication and the L-BFGS iterative optimization algorithm with the consideration of the characteristics of FPGAs. The experimental results show that we reach a $12.78X$ overall speedup of the glass-free display application.

不断增强的计算能力使以前运行时禁止的各种新应用程序成为可能。FPGA就是这样一种具有可重构性和能效的计算能力。在本文中，我们论证了通过FPGA加速实现无眼镜显示的可行性。针对fpga的特点，提出了几种加速稀疏矩阵向量乘法和L-BFGS迭代优化算法的技术。实验结果表明，在无玻璃显示应用中，我们达到了12.78X的整体加速。

引用次数: 0

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center 在异构云数据中心中启用灵活网络FPGA集群

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021742

Naif Tarafdar, Thomas Lin, E. Fukuda, H. Bannazadeh, A. Leon-Garcia, P. Chow

We present a framework for creating network FPGA clusters in a heterogeneous cloud data center. The FPGA clusters are created using a logical kernel description describing how a group of FPGA kernels are to be connected (independent of which FPGA these kernels are on), and an FPGA mapping file. The kernels within a cluster can be replicated with simple directives within this framework. The FPGAs can communicate to any other network device in the data center, including CPUs, GPUs, and IoT devices (such as sensors). This heterogeneous cloud manages these devices with the use of OpenStack. We observe that our infrastructure is limited due to the physical infrastructure such as the 1~Gb Ethernet connection. Our framework however can be ported to other physical infrastructures. We tested our infrastructure with a database acceleration application. This application was replicated six times across three FPGAs within our cluster and we observed a throughput increase of six times as this scaled linearly. Our framework generates the OpenStack calls needed to reserve the compute devices, creates the network connections (and retrieve MAC addresses), generate the bitstreams, programs the devices, and configure the devices with the appropriate MAC addresses, creating a ready-to-use network device that can interact with any other network device in the data center.

提出了一种在异构云数据中心中创建网络FPGA集群的框架。FPGA集群是使用逻辑内核描述和FPGA映射文件创建的，该描述描述了如何连接一组FPGA内核(与这些内核位于哪个FPGA无关)。集群中的内核可以在这个框架中使用简单的指令进行复制。fpga可以与数据中心的任何其他网络设备通信，包括cpu、gpu和物联网设备(如传感器)。异构云通过OpenStack对这些设备进行管理。我们观察到，由于物理基础设施(如1~Gb以太网连接)，我们的基础设施受到限制。然而，我们的框架可以移植到其他物理基础设施上。我们用一个数据库加速应用程序测试了我们的基础设施。这个应用程序在我们集群中的三个fpga上复制了六次，我们观察到吞吐量增加了六倍。我们的框架生成预留计算设备所需的OpenStack调用，创建网络连接(并检索MAC地址)，生成比特流，对设备进行编程，并使用适当的MAC地址配置设备，创建一个可以与数据中心中任何其他网络设备交互的即用型网络设备。

{"title":"Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center","authors":"Naif Tarafdar, Thomas Lin, E. Fukuda, H. Bannazadeh, A. Leon-Garcia, P. Chow","doi":"10.1145/3020078.3021742","DOIUrl":"https://doi.org/10.1145/3020078.3021742","url":null,"abstract":"We present a framework for creating network FPGA clusters in a heterogeneous cloud data center. The FPGA clusters are created using a logical kernel description describing how a group of FPGA kernels are to be connected (independent of which FPGA these kernels are on), and an FPGA mapping file. The kernels within a cluster can be replicated with simple directives within this framework. The FPGAs can communicate to any other network device in the data center, including CPUs, GPUs, and IoT devices (such as sensors). This heterogeneous cloud manages these devices with the use of OpenStack. We observe that our infrastructure is limited due to the physical infrastructure such as the 1~Gb Ethernet connection. Our framework however can be ported to other physical infrastructures. We tested our infrastructure with a database acceleration application. This application was replicated six times across three FPGAs within our cluster and we observed a throughput increase of six times as this scaled linearly. Our framework generates the OpenStack calls needed to reserve the compute devices, creates the network connections (and retrieve MAC addresses), generate the bitstreams, programs the devices, and configure the devices with the appropriate MAC addresses, creating a ready-to-use network device that can interact with any other network device in the data center.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128330318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 70

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture 展望:探索多fpga架构下的大规模图形处理

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021739

Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, Huazhong Yang

The performance of large-scale graph processing suffers from challenges including poor locality, lack of scalability, random access pattern, and heavy data conflicts. Some characteristics of FPGA make it a promising solution to accelerate various applications. For example, on-chip block RAMs can provide high throughput for random data access. However, large-scale processing on a single FPGA chip is constrained by limited on-chip memory resources and off-chip bandwidth. Using a multi-FPGA architecture may alleviate these problems to some extent, while the data partitioning and communication schemes should be considered to ensure the locality and reduce data conflicts. In this paper, we propose ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture. In ForeGraph, each FPGA board only stores a partition of the entire graph in off-chip memory. Communication over partitions is reduced. Vertices and edges are sequentially loaded onto the FPGA chip and processed. Under our scheduling scheme, each FPGA chip performs graph processing in parallel without conflicts. We also analyze the impact of system parameters on the performance of ForeGraph. Our experimental results on Xilinx Virtex UltraScale XCVU190 chip show ForeGraph outperforms state-of-the-art FPGA-based large-scale graph processing systems by 4.54x when executing PageRank on the Twitter graph (1.4 billion edges). The average throughput is over 900 MTEPS in our design and 2.03x larger than previous work.

大规模图处理的性能受到局部性差、缺乏可扩展性、随机访问模式和大量数据冲突等挑战。FPGA的一些特性使其成为加速各种应用的一个很有前途的解决方案。例如，片上块ram可以为随机数据访问提供高吞吐量。然而，单块FPGA芯片上的大规模处理受到片内存储器资源和片外带宽的限制。采用多fpga架构可以在一定程度上缓解这些问题，但需要考虑数据分区和通信方案，以保证局部性和减少数据冲突。本文提出了一种基于多fpga架构的大规模图形处理框架ForeGraph。在ForeGraph中，每个FPGA板只在片外存储器中存储整个图形的一个分区。分区间的通信减少了。顶点和边依次加载到FPGA芯片上并进行处理。在我们的调度方案下，每个FPGA芯片并行地进行图处理，没有冲突。分析了系统参数对ForeGraph性能的影响。我们在Xilinx Virtex UltraScale XCVU190芯片上的实验结果表明，当在Twitter图(14亿个边)上执行PageRank时，ForeGraph比最先进的基于fpga的大规模图形处理系统高出4.54倍。在我们的设计中，平均吞吐量超过900 MTEPS，比以前的工作大2.03倍。

{"title":"ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture","authors":"Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, Huazhong Yang","doi":"10.1145/3020078.3021739","DOIUrl":"https://doi.org/10.1145/3020078.3021739","url":null,"abstract":"The performance of large-scale graph processing suffers from challenges including poor locality, lack of scalability, random access pattern, and heavy data conflicts. Some characteristics of FPGA make it a promising solution to accelerate various applications. For example, on-chip block RAMs can provide high throughput for random data access. However, large-scale processing on a single FPGA chip is constrained by limited on-chip memory resources and off-chip bandwidth. Using a multi-FPGA architecture may alleviate these problems to some extent, while the data partitioning and communication schemes should be considered to ensure the locality and reduce data conflicts. In this paper, we propose ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture. In ForeGraph, each FPGA board only stores a partition of the entire graph in off-chip memory. Communication over partitions is reduced. Vertices and edges are sequentially loaded onto the FPGA chip and processed. Under our scheduling scheme, each FPGA chip performs graph processing in parallel without conflicts. We also analyze the impact of system parameters on the performance of ForeGraph. Our experimental results on Xilinx Virtex UltraScale XCVU190 chip show ForeGraph outperforms state-of-the-art FPGA-based large-scale graph processing systems by 4.54x when executing PageRank on the Twitter graph (1.4 billion edges). The average throughput is over 900 MTEPS in our design and 2.03x larger than previous work.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127821361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 125

Dynamic Partitioning for Library based Placement on Heterogeneous FPGAs (Abstract Only) 基于库的异构fpga布局动态划分(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021803

Fubing Mao, Wei Zhang, Bingsheng He, Siew-Kei Lam

Library based design and IP reuses have been previously proposed to speed up the synthesis of large-scale FPGA designs. However, existing methods result in large area wastage due to the module size difference and the waste area inside each module. In this paper, we propose an efficient and dynamic module partitioning approach for the library based design flow that minimizes the area wastage. Our proposed approach efficiently utilizes the pre-placement module information such as relative positions of blocks including CLBs, DSPs and RAMs, and the module sizes (width, height) for placing these blocks. We introduce a B*-tree representation to enable a fast modular placement. Simulated annealing algorithm is adopted to direct each round of the placement and to search for the optimization. We develop a set of efficient rules to guide the module selection and partition during placement, to eliminate the waste area inside and between modules and achieve a more compact final placement. In addition, the proposed approach can adapt to different architectures and address the fixed-outline constraint. Experiment results show that our approach can reduce the FPGA area utilization by up to 19% compared with the state-of-the-art approach while with acceptable runtime. More detailed description of this poster can be found in our technical report [1].

先前提出了基于库的设计和IP重用来加快大规模FPGA设计的综合。然而，现有的方法由于模块尺寸的差异和每个模块内部的浪费面积，导致大面积的浪费。在本文中，我们提出了一种有效的动态模块划分方法，用于基于图书馆的设计流程，以最大限度地减少面积浪费。我们提出的方法有效地利用了预放置模块信息，如clb、dsp和ram等模块的相对位置，以及放置这些模块的模块尺寸(宽度、高度)。我们引入B*树表示来实现快速的模块化放置。采用模拟退火算法指导每一轮的布局，寻找最优。我们制定了一套有效的规则来指导模块在放置过程中的选择和划分，以消除模块内部和模块之间的浪费区域，实现更紧凑的最终放置。此外，所提出的方法可以适应不同的体系结构，并解决固定轮廓的约束。实验结果表明，在可接受的运行时间内，与现有方法相比，该方法可将FPGA的面积利用率降低19%。关于这张海报更详细的描述可以在我们的技术报告[1]中找到。

{"title":"Dynamic Partitioning for Library based Placement on Heterogeneous FPGAs (Abstract Only)","authors":"Fubing Mao, Wei Zhang, Bingsheng He, Siew-Kei Lam","doi":"10.1145/3020078.3021803","DOIUrl":"https://doi.org/10.1145/3020078.3021803","url":null,"abstract":"Library based design and IP reuses have been previously proposed to speed up the synthesis of large-scale FPGA designs. However, existing methods result in large area wastage due to the module size difference and the waste area inside each module. In this paper, we propose an efficient and dynamic module partitioning approach for the library based design flow that minimizes the area wastage. Our proposed approach efficiently utilizes the pre-placement module information such as relative positions of blocks including CLBs, DSPs and RAMs, and the module sizes (width, height) for placing these blocks. We introduce a B*-tree representation to enable a fast modular placement. Simulated annealing algorithm is adopted to direct each round of the placement and to search for the optimization. We develop a set of efficient rules to guide the module selection and partition during placement, to eliminate the waste area inside and between modules and achieve a more compact final placement. In addition, the proposed approach can adapt to different architectures and address the fixed-outline constraint. Experiment results show that our approach can reduce the FPGA area utilization by up to 19% compared with the state-of-the-art approach while with acceptable runtime. More detailed description of this poster can be found in our technical report [1].","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131317451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OLAF'17: Third International Workshop on Overlay Architectures for FPGAs 第三届fpga覆盖体系结构国际研讨会

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3030012

Hayden Kwok-Hay So, J. Wawrzynek

The Third International Workshop on Overlay Architectures for FPGAs (OLAF) is held in Monterey, California, USA, on Feburary 22, 2017 and co-located with FPGA 2017: The 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. The main objective of the workshop is to address how overlay architectures can help address the challenges and opportunites provided by FPGA-based recon gurable computing. The workshop provides a venue for researchers to present and discuss the latest developments in FPGA overlay architecture and related areas. We have assembled a program of six refereed papers with panel discussions with prominent experts in the field.

第三届FPGA覆盖架构国际研讨会(OLAF)将于2017年2月22日在美国加利福尼亚州蒙特雷举行，并与FPGA 2017:第25届ACM/SIGDA现场可编程门阵列国际研讨会同期举行。研讨会的主要目标是解决覆盖架构如何帮助解决基于fpga的可侦察计算提供的挑战和机遇。研讨会为研究人员提供了一个展示和讨论FPGA覆盖架构和相关领域的最新发展的场所。我们汇集了六篇论文，并与该领域的知名专家进行小组讨论。

引用次数: 0

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? fpga能打败gpu加速下一代深度神经网络吗?

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021740

E. Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan J. M. Moss, S. Subhaschandra, Guy Boudoukh

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today's GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming Intel® 14-nm Stratix? 10 FPGAs will have thousands of hard floating-point units (DSPs) and on-chip RAMs (M20K memory blocks). They will also have high bandwidth memories (HBMs) and improved frequency (HyperFlex? core architecture). This combination of features brings FPGA raw floating point performance within striking distance of GPUs. Meanwhile, DNNs are quickly evolving. For example, recent innovations that exploit sparsity (e.g., pruning) and compact data types (e.g., 1-2 bit) result in major leaps in algorithmic efficiency. However, these innovations introduce irregular parallelism on custom data types, which are difficult for GPUs to handle but would be a great fit for FPGA's extreme customizability. This paper evaluates a selection of emerging DNN algorithms on two generations of Intel FPGAs (Arria'10, Stratix'10) against the latest highest performance Titan X Pascal GPU. We created a customizable DNN accelerator template for FPGAs and used it in our evaluations. First, we study various GEMM operations for next-generation DNNs. Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. Then, we present a detailed case study on accelerating Ternary ResNet which relies on sparse GEMM on 2-bit weights (i.e., weights constrained to 0,+1,-1) and full-precision neurons. The Ternary ResNet accuracy is within ~1% of the full-precision ResNet which won the 2015 ImageNet competition. On Ternary-ResNet, the Stratix 10 FPGA can deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating next-generation DNNs.

当前一代深度神经网络(dnn)，如AlexNet和VGG，严重依赖于密集浮点矩阵乘法(GEMM)，它可以很好地映射到gpu(规则并行性，高TFLOP/s)。正因为如此，gpu被广泛用于加速dnn。目前的fpga提供卓越的能源效率(Ops/Watt)，但它们不能在dnn上提供当今gpu的性能。在本文中，我们着眼于即将到来的FPGA技术进步，深度神经网络算法的快速创新步伐，并考虑未来高性能FPGA是否会优于下一代深度神经网络的gpu。即将推出的Intel®14nm Stratix?10个fpga将有数千个硬浮点单元(dsp)和片上ram (M20K内存块)。它们还将具有高带宽存储器(HBMs)和改进的频率(HyperFlex?核心架构)。这些特性的结合使FPGA的原始浮点性能与gpu相差无几。与此同时，深度神经网络正在迅速发展。例如，最近利用稀疏性(例如，修剪)和紧凑数据类型(例如，1-2位)的创新导致算法效率的重大飞跃。然而，这些创新在自定义数据类型上引入了不规则的并行性，这对gpu来说很难处理，但非常适合FPGA的极端可定制性。本文评估了两代英特尔fpga (Arria'10, Stratix'10)上新兴DNN算法的选择，以及最新的最高性能Titan X Pascal GPU。我们为fpga创建了一个可定制的DNN加速器模板，并在我们的评估中使用它。首先，我们研究了下一代dnn的各种GEMM操作。我们的研究结果表明，在修剪、Int6和二值化dnn的GEMM操作上，Stratix 10 FPGA的性能(TOP/sec)分别比Titan X Pascal GPU高10%、50%和5.4倍。然后，我们给出了一个详细的加速三元ResNet的案例研究，它依赖于2位权重(即，权重约束为0，+1，-1)和全精度神经元的稀疏GEMM。三元ResNet的精度在2015年ImageNet竞赛中获胜的全精度ResNet的1%以内。在Ternary-ResNet上，Stratix 10 FPGA的性能比Titan X Pascal GPU提高了60%，而性能/瓦特提高了2.3倍。我们的结果表明，fpga可能成为加速下一代深度神经网络的首选平台。

{"title":"Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?","authors":"E. Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan J. M. Moss, S. Subhaschandra, Guy Boudoukh","doi":"10.1145/3020078.3021740","DOIUrl":"https://doi.org/10.1145/3020078.3021740","url":null,"abstract":"Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today's GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming Intel® 14-nm Stratix? 10 FPGAs will have thousands of hard floating-point units (DSPs) and on-chip RAMs (M20K memory blocks). They will also have high bandwidth memories (HBMs) and improved frequency (HyperFlex? core architecture). This combination of features brings FPGA raw floating point performance within striking distance of GPUs. Meanwhile, DNNs are quickly evolving. For example, recent innovations that exploit sparsity (e.g., pruning) and compact data types (e.g., 1-2 bit) result in major leaps in algorithmic efficiency. However, these innovations introduce irregular parallelism on custom data types, which are difficult for GPUs to handle but would be a great fit for FPGA's extreme customizability. This paper evaluates a selection of emerging DNN algorithms on two generations of Intel FPGAs (Arria'10, Stratix'10) against the latest highest performance Titan X Pascal GPU. We created a customizable DNN accelerator template for FPGAs and used it in our evaluations. First, we study various GEMM operations for next-generation DNNs. Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. Then, we present a detailed case study on accelerating Ternary ResNet which relies on sparse GEMM on 2-bit weights (i.e., weights constrained to 0,+1,-1) and full-precision neurons. The Ternary ResNet accuracy is within ~1% of the full-precision ResNet which won the 2015 ImageNet competition. On Ternary-ResNet, the Stratix 10 FPGA can deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating next-generation DNNs.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126629956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 372

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 深度卷积神经网络FPGA加速中的循环运算和数据流优化

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021736

Yufei Ma, Yu Cao, S. Vrudhula, Jae-sun Seo

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

卷积层是卷积神经网络(CNN)算法的主要运算层，有效的卷积加速方案对卷积神经网络硬件加速器的效率和性能有重要影响。cnn中的卷积涉及到四层循环的三维乘法和累积(MAC)运算，这导致了很大的设计空间。之前的工作要么采用有限循环优化技术，例如循环展开、平铺和交换，要么在加速器架构和数据流已经固定之后才调整一些设计变量。在硬件设计阶段之前，如果不充分研究卷积循环优化，所得到的加速器很难有效地利用数据重用和管理数据移动。本工作通过基于多个设计变量定量分析和优化CNN加速器的设计目标(例如所需的内存访问)，克服了这些障碍。我们通过搜索设计变量配置，系统地探索了硬件成本的权衡，并提出了一种特定的硬件CNN加速数据流，以最大限度地减少内存访问和数据移动，同时最大限度地提高资源利用率，以实现高性能。通过实现端到端VGG-16 CNN模型，在独立的Altera Arria 10 GX 1150 FPGA上演示了所提出的CNN加速方案和架构，实现了645.25 GOPS的吞吐量和47.97 ms的延迟，与目前最先进的VGG模型FPGA实现相比，提高了3.2倍以上。

{"title":"Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks","authors":"Yufei Ma, Yu Cao, S. Vrudhula, Jae-sun Seo","doi":"10.1145/3020078.3021736","DOIUrl":"https://doi.org/10.1145/3020078.3021736","url":null,"abstract":"As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116214642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 326

Session details: High-Level Synthesis -- Tools and Applications 会话详细信息:高级合成——工具和应用程序

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257189

S. Neuendorffer

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀