Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第4页

A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only 基于FPGA的卷积神经网络分割低功耗反卷积加速器:仅摘要

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174991

Shuanglong Liu, Xinyu Niu, W. Luk

Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop fully customized deconvolution architecture for CNN-based segmentation algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. Furthermore, a hardware mapping framework is developed to automatically generate the high-throughput hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq-7030 and the deconvolution accelerator achieves a performance of 25.6 GOPS under 200MHz working frequency and a performance density of 0.064 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq-7030 board, and the system achieves a performance of 57.2 GOPS and 0.143 GOPS/DSP using 16-bit quantization, and supports up to 2 frames per second for 512x512 image inputs with a power consumption of only 3.2W.

基于卷积神经网络(cnn)的算法已经成功地解决了图像识别问题，显示出非常大的精度提高。近年来，反卷积层被广泛用作最先进的cnn的关键组件，用于端到端训练和模型，以支持图像分割等任务。然而，反卷积算法的计算量很大，这限制了它们在实时应用中的适用性。特别是，关于在FPGA平台上有效实现反卷积算法的研究很少。在这项工作中，我们提出并开发了基于cnn的分割算法的完全定制的反卷积架构。此外，对于基于fpga的CNN加速器以及其他优化技术，提出了计算模块之间的内存共享。此外，开发了一个硬件映射框架，可以自动生成目标设备上任意给定CNN模型的高吞吐量硬件设计。最后，我们在Xilinx Zynq-7030上实现了我们的设计，反卷积加速器在200MHz工作频率下的性能为25.6 GOPS，使用32位量化的性能密度为0.064 GOPS/DSP，显著优于以前在fpga上的设计。在Zynq-7030板上对CNN加速器进行了场景分割的实时应用，系统采用16位量化实现了57.2 GOPS和0.143 GOPS/DSP的性能，支持高达2帧/秒的512x512图像输入，功耗仅为3.2W。

{"title":"A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only","authors":"Shuanglong Liu, Xinyu Niu, W. Luk","doi":"10.1145/3174243.3174991","DOIUrl":"https://doi.org/10.1145/3174243.3174991","url":null,"abstract":"Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop fully customized deconvolution architecture for CNN-based segmentation algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. Furthermore, a hardware mapping framework is developed to automatically generate the high-throughput hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq-7030 and the deconvolution accelerator achieves a performance of 25.6 GOPS under 200MHz working frequency and a performance density of 0.064 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq-7030 board, and the system achieves a performance of 57.2 GOPS and 0.143 GOPS/DSP using 16-bit quantization, and supports up to 2 frames per second for 512x512 image inputs with a power consumption of only 3.2W.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2021 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128234782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only) 具有数据重用和细粒度并行性的模板计算优化微体系结构(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174964

Yuze Chi, Peipei Zhou, J. Cong

Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.

模板计算是图像处理、求解偏微分方程和元胞自动机等许多应用中最重要的核心之一。然而，由于高内存访问负载和低操作强度的特性，实现高吞吐量的模板内核并非易事。在这项工作中，我们采用数据重用和细粒度并行，提出了一种最优的模板计算微架构。数据重用行缓冲区不仅充分利用外部内存带宽和完全重用输入数据，而且在给定细粒度并行化和完全流水线化pe的数量的情况下，它们还将数据重用缓冲区的大小最小化。利用所提出的微体系结构，可以增加pe的数量以饱和所有可用的片外存储器带宽。我们使用基于高级合成(HLS)的模板来实现这个微架构，而不是寄存器传输级(RTL)规范，这提供了很大的可编程性。为了指导系统设计，除了详细的模型评估和优化分析外，我们还提出了一个性能模型。机载执行的实验结果表明，我们的设计可以提供平均6.5倍的加速，而只有2.4倍的资源开销。与仅循环转换的设计相比，我们的设计可以实现全流水线加速器，用于仅循环转换的应用程序，由于其高内存冲突和低设计灵活性而无法实现。此外，我们的FPGA实现提供14核CPU 83%的吞吐量和4倍的能效。

{"title":"An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only)","authors":"Yuze Chi, Peipei Zhou, J. Cong","doi":"10.1145/3174243.3174964","DOIUrl":"https://doi.org/10.1145/3174243.3174964","url":null,"abstract":"Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134500808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Framework for Generating High Throughput CNN Implementations on FPGAs 在fpga上生成高吞吐量CNN实现的框架

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174265

Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna

We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs "wasted" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the "wasted" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8times$ (AlexNet) and $4.9times$ (VGG16) improvement compared with the state-of-the-art implementations.

我们提出了一个框架来生成高效的fpga推理加速器。我们的框架由多个算法优化组成，以减少计算复杂性和通信量，有效利用资源的映射方法，以及自动生成textttVerilog的工具。算法优化提高了频域卷积的吞吐量，以满足给定的硬件约束。虽然重叠和添加(OaA)技术已经为人所知，但它在边缘执行“浪费”的计算。我们提出了一种新的连接和填充(CaP)技术，该技术通过减少填充像素上的“浪费”计算来显着提高OaA。与OaA结合使用的CaP使我们能够在设计时选择固定的FFT大小，并且对于不同图像大小和内核窗口大小的层实现较低的计算复杂度。我们还开发了一种新的频域环路平铺技术，通过改善数据重用来进一步提高吞吐量。我们的映射方法通过快速的设计空间探索来优化目标设备的架构。我们通过捕获它们的DSP资源、片上存储器大小和外部存储器带宽来定量地对fpga进行分类。我们根据计算和通信成本之间的权衡来确定最优的体系结构参数。我们的框架包括一个自动生成完全可合成的 texttverilog的工具。我们通过在Intel HARP异构平台上为最先进的CNN模型生成高吞吐量加速器来演示该框架。使用我们的框架，我们分别为AlexNet, VGG16和fcn -16实现了$780.6$ $GOPS$， $669.1$ $GOPS$和$552.1$ $GOPS$。与最先进的实现相比，这相当于6.8美元(AlexNet)和4.9美元(VGG16)的改进。

{"title":"A Framework for Generating High Throughput CNN Implementations on FPGAs","authors":"Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna","doi":"10.1145/3174243.3174265","DOIUrl":"https://doi.org/10.1145/3174243.3174265","url":null,"abstract":"We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs \"wasted\" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the \"wasted\" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8times$ (AlexNet) and $4.9times$ (VGG16) improvement compared with the state-of-the-art implementations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130199847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

Routing Magic: Performing Computations Using Routing Networks and Voting Logic on Unary Encoded Data 路由魔法:在一元编码数据上使用路由网络和投票逻辑执行计算

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174267

S. Mohajer, Zhiheng Wang, K. Bazargan

The binary number representation has dominated digital logic for decades due to its compact storage requirements. However, since the number system is positional, it needs to "unpack»» bits, perform computations, and repack the bits back to binary (emphe.g., partial products in multiplication).An alternative representation is the unary number system: we use N bits, out of which the first M are 1 and the rest are 0 to represent the value $M/N$. We present a novel method which first converts binary numbers to unary using thermometer encoders, then uses a "scaling network»» followed by voting gates that we call "alternator logic»», followed by an adder tree to convert the numbers back to the binary format. For monotonically increasing functions, the scaling network is all we need, which essentially uses only the routing resources and flip-flops on the FPGA architecture. Our method is especially well-suited to FPGAs due to the abundant availability of routing and FF resources, and for the ability of FPGAs to realize high fanout gates for highly oscillating functions. We compare our method to stochastic computing and to conventional binary implementations on a number of functions, as well as on two common image processing applications. Our method is clearly superior to the conventional binary implementation: our area×delay cost is on average only 3%, 8% and 32% of the binary method for 8-, 10-, and 12-bit resolutions respectively. Compared to stochastic computing, our cost is 6%, 5%, and 8% for those resolutions. The area cost includes conversions from and to the binary format. Our method out performs the conventional binary method on an edge detection algorithm. However, it is not competitive with the binary method on the median filtering application due to the high cost of generating and saving unary representations of the input pixels.

二进制数表示法由于其紧凑的存储要求，几十年来一直主导着数字逻辑。然而，由于数字系统是位置的，它需要“解包”* *位，执行计算，并将这些位重新打包回二进制(emph .g)。(乘法中的偏积)。另一种表示是一元数字系统:我们使用N位，其中前M位为1，其余为0来表示值$M/N$。我们提出了一种新的方法，首先使用温度计编码器将二进制数转换为一元数，然后使用“缩放网络”，然后使用我们称为“交流发电机逻辑”的投票门，然后使用加法器树将数字转换回二进制格式。对于单调递增的函数，我们只需要缩放网络，它本质上只使用FPGA架构上的路由资源和触发器。由于路由和FF资源的丰富可用性，以及fpga实现高振荡功能的高扇出门的能力，我们的方法特别适合fpga。我们将我们的方法与随机计算和传统的二进制实现在许多函数上进行比较，以及在两个常见的图像处理应用程序上。我们的方法明显优于传统的二进制实现:对于8位、10位和12位分辨率，我们的area×delay成本平均仅为二进制方法的3%、8%和32%。与随机计算相比，这些分辨率的成本分别为6%、5%和8%。面积成本包括从二进制格式到二进制格式的转换。该方法在边缘检测算法上优于传统的二值化方法。然而，由于生成和保存输入像素的一元表示的高成本，它在中值滤波应用上与二值方法没有竞争。

{"title":"Routing Magic: Performing Computations Using Routing Networks and Voting Logic on Unary Encoded Data","authors":"S. Mohajer, Zhiheng Wang, K. Bazargan","doi":"10.1145/3174243.3174267","DOIUrl":"https://doi.org/10.1145/3174243.3174267","url":null,"abstract":"The binary number representation has dominated digital logic for decades due to its compact storage requirements. However, since the number system is positional, it needs to \"unpack»» bits, perform computations, and repack the bits back to binary (emphe.g., partial products in multiplication).An alternative representation is the unary number system: we use N bits, out of which the first M are 1 and the rest are 0 to represent the value $M/N$. We present a novel method which first converts binary numbers to unary using thermometer encoders, then uses a \"scaling network»» followed by voting gates that we call \"alternator logic»», followed by an adder tree to convert the numbers back to the binary format. For monotonically increasing functions, the scaling network is all we need, which essentially uses only the routing resources and flip-flops on the FPGA architecture. Our method is especially well-suited to FPGAs due to the abundant availability of routing and FF resources, and for the ability of FPGAs to realize high fanout gates for highly oscillating functions. We compare our method to stochastic computing and to conventional binary implementations on a number of functions, as well as on two common image processing applications. Our method is clearly superior to the conventional binary implementation: our area×delay cost is on average only 3%, 8% and 32% of the binary method for 8-, 10-, and 12-bit resolutions respectively. Compared to stochastic computing, our cost is 6%, 5%, and 8% for those resolutions. The area cost includes conversions from and to the binary format. Our method out performs the conventional binary method on an edge detection algorithm. However, it is not competitive with the binary method on the median filtering application due to the high cost of generating and saving unary representations of the input pixels.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130041205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A Full-System VM-HDL Co-Simulation Framework for Servers with PCIe-Connected FPGAs fpga服务器全系统VM-HDL协同仿真框架

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174269

Shenghsun Cho, Mrunal Patel, Han Chen, M. Ferdman, Peter Milder

The need for high-performance and low-power acceleration technologies in servers is driving the adoption of PCIe-connected FPGAs in datacenter environments. However, the co-development of the application software, driver, and hardware HDL for server FPGA platforms remains one of the fundamental challenges standing in the way of wide-scale adoption. The FPGA accelerator development process is plagued by a lack of comprehensive full-system simulation tools, unacceptably slow debug iteration times, and limited visibility into the software and hardware at the time of failure. In this work, we develop a framework that pairs a virtual machine and an HDL simulator to enable full-system co-simulation of a server system with a PCIe-connected FPGA. Our framework enables rapid development and debugging of unmodified application software, operating system, device drivers, and hardware design. Once debugged, neither the software nor the hardware requires any changes before being deployed in a production environment. In our case studies, we find that the co-simulation framework greatly improves debug iteration time while providing invaluable visibility into both the software and hardware components.

服务器对高性能和低功耗加速技术的需求正在推动数据中心环境中采用pcie连接的fpga。然而，服务器FPGA平台的应用软件、驱动程序和硬件HDL的共同开发仍然是阻碍大规模采用的基本挑战之一。FPGA加速器的开发过程受到缺乏全面的全系统仿真工具、难以接受的缓慢的调试迭代时间以及在故障时对软件和硬件的有限可见性的困扰。在这项工作中，我们开发了一个框架，该框架将虚拟机和HDL模拟器配对，以实现服务器系统与pcie连接的FPGA的全系统协同仿真。我们的框架能够快速开发和调试未经修改的应用软件、操作系统、设备驱动程序和硬件设计。调试完成后，软件和硬件在部署到生产环境之前都不需要任何更改。在我们的案例研究中，我们发现联合模拟框架大大改善了调试迭代时间，同时为软件和硬件组件提供了宝贵的可见性。

{"title":"A Full-System VM-HDL Co-Simulation Framework for Servers with PCIe-Connected FPGAs","authors":"Shenghsun Cho, Mrunal Patel, Han Chen, M. Ferdman, Peter Milder","doi":"10.1145/3174243.3174269","DOIUrl":"https://doi.org/10.1145/3174243.3174269","url":null,"abstract":"The need for high-performance and low-power acceleration technologies in servers is driving the adoption of PCIe-connected FPGAs in datacenter environments. However, the co-development of the application software, driver, and hardware HDL for server FPGA platforms remains one of the fundamental challenges standing in the way of wide-scale adoption. The FPGA accelerator development process is plagued by a lack of comprehensive full-system simulation tools, unacceptably slow debug iteration times, and limited visibility into the software and hardware at the time of failure. In this work, we develop a framework that pairs a virtual machine and an HDL simulator to enable full-system co-simulation of a server system with a PCIe-connected FPGA. Our framework enables rapid development and debugging of unmodified application software, operating system, device drivers, and hardware design. Once debugged, neither the software nor the hardware requires any changes before being deployed in a production environment. In our case studies, we find that the co-simulation framework greatly improves debug iteration time while providing invaluable visibility into both the software and hardware components.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122789235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

FGC: A Tool-flow for Generating and Configuring Custom FPGAs(Abstract Only) FGC:生成和配置自定义fpga的工具流(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174997

Oluseyi A. Ayorinde, He Qi, B. Calhoun

We introduce the FGC Toolflow, the only tool providing flexible custom-FPGA generation and configuration to-date. Currently, researchers building custom FPGAs must create for FPGA schematics and bitstreams by hand. Both tasks are prohibitively time intensive and error prone. Additionally, the simulation time for bitcell configuration is very long (often times longer than the functionality), making the verification of FPGA fabrics even more time consuming. Some existing toolflows and software packages designed to help with this process, but they only generate bitcell configurations, leaving schematics to be developed by hand. Others have limitations in circuit-level and architectural parameters, which prevent them from adequately exploring the FPGA design space. The FGC flow is the only flow available that generates a custom full-FPGA schematic from a single parameter text file, and generates the proper configuration bitstream for a target Verilog functionality. The parameter text file can accommodate 100s of different parameters, which include both circuit-level and architectural parameters to fully encompass the FPGA design space. The FGC flow generates both a schematic and a configuration bitstream for an FPGA with 100 CLBs (900,000 transistors) in only 8 minutes. The flow also generates simulation files, allowing the user to quickly set up and perform simulations to verify the FPGA and its configuration at the chip level with SPICE-level accuracy. This flow was used to create, verify, and test a taped-out ultra-low power FPGA.

我们介绍了FGC Toolflow，这是迄今为止唯一提供灵活的定制fpga生成和配置的工具。目前，构建定制FPGA的研究人员必须手工创建FPGA原理图和位流。这两项任务都非常耗时，而且容易出错。此外，位单元配置的模拟时间非常长(通常比功能长几倍)，使得FPGA结构的验证更加耗时。一些现有的工具流和软件包旨在帮助完成这一过程，但是它们只生成位单元配置，使得原理图需要手工开发。其他电路在电路级和架构参数方面有限制，这使它们无法充分探索FPGA设计空间。FGC流程是唯一可用的流程，可以从单个参数文本文件生成自定义的全fpga原理图，并为目标Verilog功能生成适当的配置位流。参数文本文件可以容纳100个不同的参数，其中包括电路级和架构参数，以完全涵盖FPGA设计空间。FGC流仅在8分钟内为具有100个clb(900,000个晶体管)的FPGA生成原理图和配置位流。该流程还生成仿真文件，允许用户快速设置和执行仿真，以spice级精度在芯片级验证FPGA及其配置。该流程用于创建、验证和测试带出的超低功耗FPGA。

{"title":"FGC: A Tool-flow for Generating and Configuring Custom FPGAs(Abstract Only)","authors":"Oluseyi A. Ayorinde, He Qi, B. Calhoun","doi":"10.1145/3174243.3174997","DOIUrl":"https://doi.org/10.1145/3174243.3174997","url":null,"abstract":"We introduce the FGC Toolflow, the only tool providing flexible custom-FPGA generation and configuration to-date. Currently, researchers building custom FPGAs must create for FPGA schematics and bitstreams by hand. Both tasks are prohibitively time intensive and error prone. Additionally, the simulation time for bitcell configuration is very long (often times longer than the functionality), making the verification of FPGA fabrics even more time consuming. Some existing toolflows and software packages designed to help with this process, but they only generate bitcell configurations, leaving schematics to be developed by hand. Others have limitations in circuit-level and architectural parameters, which prevent them from adequately exploring the FPGA design space. The FGC flow is the only flow available that generates a custom full-FPGA schematic from a single parameter text file, and generates the proper configuration bitstream for a target Verilog functionality. The parameter text file can accommodate 100s of different parameters, which include both circuit-level and architectural parameters to fully encompass the FPGA design space. The FGC flow generates both a schematic and a configuration bitstream for an FPGA with 100 CLBs (900,000 transistors) in only 8 minutes. The flow also generates simulation files, allowing the user to quickly set up and perform simulations to verify the FPGA and its configuration at the chip level with SPICE-level accuracy. This flow was used to create, verify, and test a taped-out ultra-low power FPGA.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116656839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA Fastfood - A High Speed Systolic Implementation of a Large Scale Online Kernel Method FPGA Fastfood -一种大规模在线核方法的高速收缩实现

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174271

Sean Fox, D. Boland, P. Leong

In this paper, we describe a systolic Field Programmable Gate Array (FPGA) implementation of the Fastfood algorithm that is optimised to run at a high frequency. The Fastfood algorithm supports online learning for large scale kernel methods. Empirical results show that 500 MHz clock rates can be sustained for an architecture that can solve problems with input dimensions that are $10^3$ times larger than previously reported. Unlike many recent deep learning publications, this design implements both training and prediction. This enables the use of kernel methods in applications requiring a rare combination of capacity, adaption and speed.

在本文中，我们描述了一种收缩现场可编程门阵列(FPGA)的Fastfood算法的实现，该算法被优化为在高频下运行。Fastfood算法支持大规模核方法的在线学习。经验结果表明，对于可以解决输入尺寸比先前报道的大10^3倍的问题的架构，可以维持500 MHz时钟速率。与最近的许多深度学习出版物不同，该设计同时实现了训练和预测。这使得在需要容量、适应性和速度的罕见组合的应用程序中使用内核方法成为可能。

引用次数: 3

Towards Serial-Equivalent Parallel Routing for FPGAs: (Abstract Only) fpga的串行等效并行路由(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174974

Minghua Shen, Wentai Zhang, Nong Xiao, Guojie Luo

Serial equivalency can provide easier regression testing and customer support in production-grade CAD software. While existing parallel routing techniques have become sufficiently advanced to accelerate the execution time, support for serial equivalency has been very limited or ignored due to it was considered costly. In this paper, we propose serial-equivalent parallel routing for FPGAs. We use an optimal dependency-aware scheduling to facilitate serial equivalency of parallel routing algorithm. This capability enables the same answer as the serial version of the parallel algorithm, regardless of how many processing cores are used. We also validate this property across different hardware platforms. Further experimental results show that we achieve a 14.27x speedup on the MPI-based distributed parallel computer and a 19.65x speedup on the GPU-based massively parallel machine. To our knowledge, it is the first parallel routing with a serial equivalency guarantee.

串行等效可以在生产级CAD软件中提供更容易的回归测试和客户支持。虽然现有的并行路由技术已经足够先进，可以加快执行时间，但对串行等价的支持非常有限，或者被认为代价高昂而被忽略。在本文中，我们提出了fpga的串行等效并行路由。为了实现并行路由算法的串行等效性，我们采用了一种最优依赖感知调度。无论使用多少处理内核，该功能都可以实现与并行算法的串行版本相同的答案。我们还在不同的硬件平台上验证这个属性。进一步的实验结果表明，我们在基于mpi的分布式并行计算机上实现了14.27倍的加速，在基于gpu的大规模并行计算机上实现了19.65倍的加速。据我们所知，这是第一个具有串行等效保证的并行路由。

引用次数: 0

SIFT Keypoint Descriptor Matching Algorithm: A Fully Pipelined Accelerator on FPGA(Abstract Only) SIFT关键点描述子匹配算法:FPGA上的全流水线加速器(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174994

Luka Daoud, M. K. Latif, N. Rafla

Scale Invariant Feature Transform (SIFT) algorithm is one of the classical feature extraction algorithms that is well known in Computer Vision. It consists of two stages: keypoint descriptor extraction and descriptor matching. SIFT descriptor matching algorithm is a computational intensive process. In this work, we present a design and implementation of a hardware core accelerator for the descriptor-matching algorithm on a field programmable gate array (FPGA). Our proposed hardware core architecture is able to cope with the memory bandwidth and hit the roofline performance model to achieve maximum throughput. The matching-core was implemented using Xilinx Vivado® EDA design suite on a Zynq®-based FPGA Development board. The proposed matching-core architecture is fully pipelined for 16-bit fixed-point operations and consists of five main submodules designed in Verilog, High Level Synthesis, and System Generator. The area resources were significantly reduced compared to the most recent matching-core implemented on hardware. While our proposed hardware accelerator matching-core was able to detect 98% matching-points compared to the software approach, it is 15.7 × faster.

尺度不变特征变换(SIFT)算法是计算机视觉中公认的经典特征提取算法之一。它包括两个阶段:关键点描述符提取和描述符匹配。SIFT描述符匹配算法是一个计算密集型的过程。在这项工作中，我们提出了一个硬件核心加速器的设计和实现，用于现场可编程门阵列(FPGA)上的描述符匹配算法。我们提出的硬件核心架构能够处理内存带宽并达到rooline性能模型以实现最大吞吐量。匹配核是在基于Zynq®的FPGA开发板上使用Xilinx Vivado®EDA设计套件实现的。所提出的匹配核心架构是完全流水线的16位定点操作，由五个主要子模块组成，分别在Verilog、High Level Synthesis和System Generator中设计。与最近在硬件上实现的匹配核相比，区域资源显着减少。虽然我们提出的硬件加速器匹配核心能够检测到98%的匹配点，但与软件方法相比，它的速度快了15.7倍。

{"title":"SIFT Keypoint Descriptor Matching Algorithm: A Fully Pipelined Accelerator on FPGA(Abstract Only)","authors":"Luka Daoud, M. K. Latif, N. Rafla","doi":"10.1145/3174243.3174994","DOIUrl":"https://doi.org/10.1145/3174243.3174994","url":null,"abstract":"Scale Invariant Feature Transform (SIFT) algorithm is one of the classical feature extraction algorithms that is well known in Computer Vision. It consists of two stages: keypoint descriptor extraction and descriptor matching. SIFT descriptor matching algorithm is a computational intensive process. In this work, we present a design and implementation of a hardware core accelerator for the descriptor-matching algorithm on a field programmable gate array (FPGA). Our proposed hardware core architecture is able to cope with the memory bandwidth and hit the roofline performance model to achieve maximum throughput. The matching-core was implemented using Xilinx Vivado® EDA design suite on a Zynq®-based FPGA Development board. The proposed matching-core architecture is fully pipelined for 16-bit fixed-point operations and consists of five main submodules designed in Verilog, High Level Synthesis, and System Generator. The area resources were significantly reduced compared to the most recent matching-core implemented on hardware. While our proposed hardware accelerator matching-core was able to detect 98% matching-points compared to the software approach, it is 15.7 × faster.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124154467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Graph-Theoretically Optimal Memory Banking for Stencil-Based Computing Kernels 基于模板的计算核的图理论最优内存存储

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174251

Juan Escobedo, Mingjie Lin

High-Level Synthesis (HLS) has advanced significantly in compiling high-level "soft»» programs into efficient register-transfer level (RTL) "hard»» specifications. However, manually rewriting C-like code is still often required in order to effectively optimize the access performance of synthesized memory subsystems. As such, extensive research has been performed on developing and implementing automated memory optimization techniques, among which memory banking has been a key technique for access performance improvement. However, several key questions remain to be answered: given a stencil-based computing kernel, what constitutes an optimal memory banking scheme that minimizes the number of memory banks required for conflict-free accesses? Furthermore, if such an optimal memory banking scheme exists, how can an FPGA designer automatically determine it? Finally, does any stencil-based kernel have the optimal banking scheme? In this paper we attempt to optimally solve memory banking problem for synthesizing stencil-based computing kernels with well-known theorems in graph theory. Our graph-based methodology not only computes the minimum memory partition factor for any given stencil, but also exploits the repeatability of coloring entire memory access conflict graph, which significantly improves hardware efficiency.

高级合成(High-Level Synthesis, HLS)在将高级“软”程序编译成高效的寄存器-传输级别(register-transfer level, RTL)方面取得了重大进展。“硬»»规范。然而，为了有效地优化合成内存子系统的访问性能，仍然经常需要手动重写类c代码。因此，在开发和实现自动内存优化技术方面进行了广泛的研究，其中内存银行已成为提高访问性能的关键技术。然而，仍然有几个关键问题需要回答:给定一个基于模板的计算内核，什么构成了一个最优的内存银行方案，使无冲突访问所需的内存银行数量最小化?此外，如果存在这样的最佳内存银行方案，FPGA设计者如何自动确定它?最后，是否有基于模板的内核具有最优的银行方案?在本文中，我们尝试用图论中众所周知的定理来最优地解决基于模板的计算核合成中的内存存储问题。我们的基于图的方法不仅计算任何给定模板的最小内存分区因子，而且利用了整个内存访问冲突图着色的可重复性，从而显着提高了硬件效率。

{"title":"Graph-Theoretically Optimal Memory Banking for Stencil-Based Computing Kernels","authors":"Juan Escobedo, Mingjie Lin","doi":"10.1145/3174243.3174251","DOIUrl":"https://doi.org/10.1145/3174243.3174251","url":null,"abstract":"High-Level Synthesis (HLS) has advanced significantly in compiling high-level \"soft»» programs into efficient register-transfer level (RTL) \"hard»» specifications. However, manually rewriting C-like code is still often required in order to effectively optimize the access performance of synthesized memory subsystems. As such, extensive research has been performed on developing and implementing automated memory optimization techniques, among which memory banking has been a key technique for access performance improvement. However, several key questions remain to be answered: given a stencil-based computing kernel, what constitutes an optimal memory banking scheme that minimizes the number of memory banks required for conflict-free accesses? Furthermore, if such an optimal memory banking scheme exists, how can an FPGA designer automatically determine it? Finally, does any stencil-based kernel have the optimal banking scheme? In this paper we attempt to optimally solve memory banking problem for synthesizing stencil-based computing kernels with well-known theorems in graph theory. Our graph-based methodology not only computes the minimum memory partition factor for any given stencil, but also exploits the repeatability of coloring entire memory access conflict graph, which significantly improves hardware efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"153 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130637483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15