首页 > 最新文献

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文 中文
Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT Interpolation Filter (Abstract Only) 基于DCT插值滤波器的FPGA精确高效双曲正切激活函数(仅摘要)
A. Abdelsalam, J. Langlois, F. Cheriet
Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed interpolation architecture combines simple arithmetic operations on the stored samples of the hyperbolic tangent function and on input data. The proposed implementation outperforms the existing implementations in terms of accuracy while using the same or fewer computational and memory resources. The proposed architecture can approximate the hyperbolic tangent activation function with 2×10-4 maximum error while requiring only 1.12 Kbits memory and 21 LUTs of a Virtex-7 FPGA.
在fpga上实现准确、快速、低成本的激活函数是实现深度神经网络(dnn)的关键。提出了一种高精度的人工神经元双曲正切激活函数逼近方法。它是基于离散余弦变换插值滤波器(DCTIF)。所提出的插值结构结合了对存储的双曲正切函数样本和输入数据的简单算术运算。在使用相同或更少的计算和内存资源的同时,所提出的实现在准确性方面优于现有实现。该架构可以近似双曲正切激活函数,最大误差为2×10-4,而只需要1.12 Kbits的内存和Virtex-7 FPGA的21个lut。
{"title":"Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT Interpolation Filter (Abstract Only)","authors":"A. Abdelsalam, J. Langlois, F. Cheriet","doi":"10.1145/3020078.3021768","DOIUrl":"https://doi.org/10.1145/3020078.3021768","url":null,"abstract":"Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed interpolation architecture combines simple arithmetic operations on the stored samples of the hyperbolic tangent function and on input data. The proposed implementation outperforms the existing implementations in terms of accuracy while using the same or fewer computational and memory resources. The proposed architecture can approximate the hyperbolic tangent activation function with 2×10-4 maximum error while requiring only 1.12 Kbits memory and 21 LUTs of a Virtex-7 FPGA.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130302875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers (Abstract Only) 最小化全连接神经网络层带宽的高效存储批处理(仅摘要)
Yongming Shen, M. Ferdman, Peter Milder
Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. These networks typically use convolutional layers for feature extraction and fully-connected layers to perform classification using those features. Significant interest in improving the performance of CNNs has led to the design of CNN accelerators to improve their evaluation throughput and efficiency. However, work on CNN accelerators has mostly concentrated on accelerating the computationally-intensive convolutional layers, while a major bottleneck of the existing designs arises due to the data-intensive fully-connected layers. Unfortunately, the leading approaches to reducing bandwidth of the fully-connected layers are limited by the storage capacity of the on-chip buffers. We observe that, in addition to the possibility of reducing CNN weight transfer bandwidth by adding more on-chip buffers, it is also possible to reduce the size of the on-chip buffers at the cost of CNN input transfer. Paradoxically, shrinking the size of the on-chip buffers costs significantly less input bandwidth than the weight bandwidth saved by adding more buffers. Leveraging these observations, we develop a design methodology for fully-connected layer accelerators that require substantially less off-chip bandwidth by balancing between the input and weight transfers. Using 160KB of BRAM enables the prior work to reduce off-chip bandwidth by 5x on the most bandwidth-intensive fully-connected layers of the popular AlexNet and VGGNet-E networks. With our newly proposed methodology, using the same 160KB of BRAM produces a design with 71x bandwidth reduction on the same networks.
卷积神经网络(cnn)被用于解决许多具有挑战性的机器学习问题。这些网络通常使用卷积层进行特征提取,并使用全连接层使用这些特征执行分类。对提高CNN性能的极大兴趣导致了CNN加速器的设计,以提高其评估吞吐量和效率。然而,CNN加速器的工作主要集中在加速计算密集型的卷积层,而现有设计的一个主要瓶颈是由于数据密集型的全连接层。不幸的是,减少全连接层带宽的主要方法受到片上缓冲区存储容量的限制。我们观察到,除了可以通过增加更多片上缓冲器来减少CNN权值传输带宽之外,还可以以牺牲CNN输入传输为代价来减小片上缓冲器的大小。矛盾的是,缩小片上缓冲器的大小所花费的输入带宽比增加更多缓冲器所节省的权重带宽要少得多。利用这些观察结果,我们开发了一种全连接层加速器的设计方法,通过平衡输入和权重传递,可以大大减少片外带宽。在流行的AlexNet和VGGNet-E网络中,在带宽最密集的全连接层上,使用160KB的BRAM可以将先前的工作减少5倍的片外带宽。使用我们新提出的方法,使用相同的160KB BRAM可以在相同的网络上产生减少71倍带宽的设计。
{"title":"Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers (Abstract Only)","authors":"Yongming Shen, M. Ferdman, Peter Milder","doi":"10.1145/3020078.3021795","DOIUrl":"https://doi.org/10.1145/3020078.3021795","url":null,"abstract":"Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. These networks typically use convolutional layers for feature extraction and fully-connected layers to perform classification using those features. Significant interest in improving the performance of CNNs has led to the design of CNN accelerators to improve their evaluation throughput and efficiency. However, work on CNN accelerators has mostly concentrated on accelerating the computationally-intensive convolutional layers, while a major bottleneck of the existing designs arises due to the data-intensive fully-connected layers. Unfortunately, the leading approaches to reducing bandwidth of the fully-connected layers are limited by the storage capacity of the on-chip buffers. We observe that, in addition to the possibility of reducing CNN weight transfer bandwidth by adding more on-chip buffers, it is also possible to reduce the size of the on-chip buffers at the cost of CNN input transfer. Paradoxically, shrinking the size of the on-chip buffers costs significantly less input bandwidth than the weight bandwidth saved by adding more buffers. Leveraging these observations, we develop a design methodology for fully-connected layer accelerators that require substantially less off-chip bandwidth by balancing between the input and weight transfers. Using 160KB of BRAM enables the prior work to reduce off-chip bandwidth by 5x on the most bandwidth-intensive fully-connected layers of the popular AlexNet and VGGNet-E networks. With our newly proposed methodology, using the same 160KB of BRAM produces a design with 71x bandwidth reduction on the same networks.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115872445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGAs in the Cloud 云中的fpga
G. Constantinides
Ever greater amounts of computing and storage are happening remotely in the cloud, and it is estimated that spending on public cloud services will grow by over 19%/year to $140B in 2019. Besides commodity processors, network and storage infrastructure, the end of clock frequency scaling in traditional processors has meant that application-specific accelerators are required in tandem with cloud-based processors to deliver continued improvements in computational performance and energy efficiency. Indeed, graphics processing units (GPUs), as well as custom ASICs, are now widely used within the cloud, particularly for compute-intensive high-value applications like machine learning. In this panel, we intend to consider the opportunities and challenges for broad deployment of FPGAs in the cloud.
越来越多的计算和存储正在远程云中进行,据估计,到2019年,公共云服务的支出将以每年19%以上的速度增长,达到1400亿美元。除了商用处理器、网络和存储基础设施之外,传统处理器时钟频率缩放的终结意味着特定应用的加速器需要与基于云的处理器协同工作,以不断提高计算性能和能源效率。事实上,图形处理单元(gpu)以及定制asic现在在云中被广泛使用,特别是在机器学习等计算密集型高价值应用中。在这个小组中,我们打算考虑在云中广泛部署fpga的机遇和挑战。
{"title":"FPGAs in the Cloud","authors":"G. Constantinides","doi":"10.1145/3020078.3030014","DOIUrl":"https://doi.org/10.1145/3020078.3030014","url":null,"abstract":"Ever greater amounts of computing and storage are happening remotely in the cloud, and it is estimated that spending on public cloud services will grow by over 19%/year to $140B in 2019. Besides commodity processors, network and storage infrastructure, the end of clock frequency scaling in traditional processors has meant that application-specific accelerators are required in tandem with cloud-based processors to deliver continued improvements in computational performance and energy efficiency. Indeed, graphics processing units (GPUs), as well as custom ASICs, are now widely used within the cloud, particularly for compute-intensive high-value applications like machine learning. In this panel, we intend to consider the opportunities and challenges for broad deployment of FPGAs in the cloud.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128112868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
FPGA Implementation of Non-Uniform DFT for Accelerating Wireless Channel Simulations (Abstract Only) 加速无线信道仿真的非均匀DFT FPGA实现(仅摘要)
Srinivas Siripurapu, Aman Gayasen, P. Gopalakrishnan, N. Chandrachoodan
FPGAs have been used as accelerators in a wide variety of domains such as learning, search, genomics, signal processing, compression, analytics and so on. In recent years, the availability of tools and flows such as high-level synthesis has made it even easier to accelerate a variety of high-performance computing applications onto FPGAs. In this paper we propose a systematic methodology for optimizing the performance of an accelerated block using the notion of compute intensity to guide optimizations in high-level synthesis. We demonstrate the effectiveness of our methodology on an FPGA implementation of a non-uniform discrete Fourier transform (NUDFT), used to convert a wireless channel model from the time-domain to the frequency domain. The acceleration of this particular computation can be used to improve the performance and capacity of wireless channel simulation, which has wide applications in the system level design and performance evaluation of wireless networks. Our results show that our FPGA implementation outperforms the same code offloaded onto GPUs and CPUs by 1.6x and 10x respectively, in performance as measured by the throughput of the accelerated block. The gains in performance per watt versus GPUs and CPUs are 15.6x and 41.5x respectively.
fpga在学习、搜索、基因组学、信号处理、压缩、分析等领域都被用作加速器。近年来,高级合成等工具和流程的可用性使得在fpga上加速各种高性能计算应用变得更加容易。在本文中,我们提出了一种系统的方法来优化加速块的性能,使用计算强度的概念来指导高级合成中的优化。我们在非均匀离散傅里叶变换(NUDFT)的FPGA实现上证明了我们的方法的有效性,该方法用于将无线信道模型从时域转换到频域。这种特殊计算的加速可以用来提高无线信道仿真的性能和容量,在无线网络的系统级设计和性能评估中有着广泛的应用。我们的结果表明,通过加速块的吞吐量测量,我们的FPGA实现的性能分别比卸载到gpu和cpu上的相同代码高1.6倍和10倍。与gpu和cpu相比,每瓦特的性能提升分别为15.6倍和41.5倍。
{"title":"FPGA Implementation of Non-Uniform DFT for Accelerating Wireless Channel Simulations (Abstract Only)","authors":"Srinivas Siripurapu, Aman Gayasen, P. Gopalakrishnan, N. Chandrachoodan","doi":"10.1145/3020078.3021800","DOIUrl":"https://doi.org/10.1145/3020078.3021800","url":null,"abstract":"FPGAs have been used as accelerators in a wide variety of domains such as learning, search, genomics, signal processing, compression, analytics and so on. In recent years, the availability of tools and flows such as high-level synthesis has made it even easier to accelerate a variety of high-performance computing applications onto FPGAs. In this paper we propose a systematic methodology for optimizing the performance of an accelerated block using the notion of compute intensity to guide optimizations in high-level synthesis. We demonstrate the effectiveness of our methodology on an FPGA implementation of a non-uniform discrete Fourier transform (NUDFT), used to convert a wireless channel model from the time-domain to the frequency domain. The acceleration of this particular computation can be used to improve the performance and capacity of wireless channel simulation, which has wide applications in the system level design and performance evaluation of wireless networks. Our results show that our FPGA implementation outperforms the same code offloaded onto GPUs and CPUs by 1.6x and 10x respectively, in performance as measured by the throughput of the accelerated block. The gains in performance per watt versus GPUs and CPUs are 15.6x and 41.5x respectively.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133560705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stochastic-Based Multi-stage Streaming Realization of a Deep Convolutional Neural Network (Abstract Only) 基于随机的深度卷积神经网络多阶段流实现(仅摘要)
Mohammed Alawad, Mingjie Lin
Large-scale convolutional neural network (CNN), conceptually mimicking the operational principle of visual perception in human brain, has been widely applied to tackle many challenging computer vision and artificial intelligence applications. Unfortunately, despite of its simple architecture, a typically sized CNN is well known to be computationally intensive. This work presents a novel stochastic-based and scalable hardware architecture and circuit design that computes a large-scale CNN with FPGA. The key idea is to implement all key components of a deep learning CNN, including multi-dimensional convolution, activation, and pooling layers, completely in the probabilistic computing domain in order to achieve high computing robustness, high performance, and low hardware usage. Most importantly, through both theoretical analysis and FPGA hardware implementation, we demonstrate that stochastic-based deep CNN can achieve superior hardware scalability when compared with its conventional deterministic-based FPGA implementation by allowing a stream computing mode and adopting efficient random sample manipulations. Overall, being highly scalable and energy efficient, our stochastic-based convolutional neural network architecture is well-suited for a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images, especially those perception-based computing tasks that are inherently fault-tolerant, while still requiring high energy efficiency.
大规模卷积神经网络(CNN)在概念上模仿人类大脑视觉感知的运作原理,已被广泛应用于解决许多具有挑战性的计算机视觉和人工智能应用。不幸的是,尽管它的架构很简单,但一个典型大小的CNN是众所周知的计算密集型的。这项工作提出了一种新颖的基于随机和可扩展的硬件架构和电路设计,可以用FPGA计算大规模的CNN。关键思想是将深度学习CNN的所有关键组件,包括多维卷积层、激活层和池化层,完全在概率计算领域实现,以实现高计算鲁棒性、高性能和低硬件使用。最重要的是,通过理论分析和FPGA硬件实现,我们证明了基于随机的深度CNN通过允许流计算模式和采用有效的随机样本操作,与传统的基于确定性的FPGA实现相比,可以实现更好的硬件可扩展性。总体而言,我们的基于随机的卷积神经网络架构具有高度可扩展性和高能效,非常适合模块化视觉引擎,其目标是对百万像素图像进行实时检测、识别和分割,特别是那些基于感知的计算任务,这些任务本质上是容错的,同时仍然需要高能效。
{"title":"Stochastic-Based Multi-stage Streaming Realization of a Deep Convolutional Neural Network (Abstract Only)","authors":"Mohammed Alawad, Mingjie Lin","doi":"10.1145/3020078.3021788","DOIUrl":"https://doi.org/10.1145/3020078.3021788","url":null,"abstract":"Large-scale convolutional neural network (CNN), conceptually mimicking the operational principle of visual perception in human brain, has been widely applied to tackle many challenging computer vision and artificial intelligence applications. Unfortunately, despite of its simple architecture, a typically sized CNN is well known to be computationally intensive. This work presents a novel stochastic-based and scalable hardware architecture and circuit design that computes a large-scale CNN with FPGA. The key idea is to implement all key components of a deep learning CNN, including multi-dimensional convolution, activation, and pooling layers, completely in the probabilistic computing domain in order to achieve high computing robustness, high performance, and low hardware usage. Most importantly, through both theoretical analysis and FPGA hardware implementation, we demonstrate that stochastic-based deep CNN can achieve superior hardware scalability when compared with its conventional deterministic-based FPGA implementation by allowing a stream computing mode and adopting efficient random sample manipulations. Overall, being highly scalable and energy efficient, our stochastic-based convolutional neural network architecture is well-suited for a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images, especially those perception-based computing tasks that are inherently fault-tolerant, while still requiring high energy efficiency.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131456256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Session details: CAD Tools 会议详情:CAD工具
Lesley Shannon
{"title":"Session details: CAD Tools","authors":"Lesley Shannon","doi":"10.1145/3257187","DOIUrl":"https://doi.org/10.1145/3257187","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":" 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113949246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network 基于opencl的卷积神经网络FPGA加速器性能改进
Jialiang Zhang, J. Li
OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.
OpenCL FPGA最近在工作负载加速的新兴需求中获得了极大的普及,例如卷积神经网络(CNN),这是计算机视觉领域最流行的深度学习架构。虽然OpenCL增强了FPGA的代码可移植性和可编程性,但它是以性能为代价的。关键的挑战是如何优化OpenCL内核,以有效地利用FPGA中灵活的硬件资源。对于卷积神经网络等计算密集型和数据密集型工作负载,仅仅通过各种编译器选项优化OpenCL内核代码是不足以实现理想性能的。在本文中,我们首先提出了一个分析性能模型,并应用它对CNN分类器核的资源需求和现代fpga上的可用资源进行了深入的分析。我们发现关键的性能瓶颈是片上存储器带宽。我们提出了一种新的内核设计来有效地解决这种带宽限制,并在计算、片上和片外存储器访问之间提供最佳平衡。作为案例研究,我们进一步应用这些技术设计了一个基于VGG模型的CNN加速器。最后,我们使用Altera Arria 10 GX1150板评估我们的CNN加速器的性能。我们在370MHz工作频率下实现了866 Gop/s的浮点性能,在385MHz工作频率下实现了1.79 Top/s的16位定点性能。据我们所知,与现有工作相比,我们的实现实现了最佳的功率效率和性能密度。
{"title":"Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network","authors":"Jialiang Zhang, J. Li","doi":"10.1145/3020078.3021698","DOIUrl":"https://doi.org/10.1145/3020078.3021698","url":null,"abstract":"OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115281475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 194
An FPGA Overlay Architecture for Cost Effective Regular Expression Search (Abstract Only) 一种高效正则表达式搜索的FPGA覆盖结构(仅摘要)
Thomas Luinaud, Y. Savaria, J. Langlois
Snort and Bro are Deep Packet Inspection systems which express complex rules with regular expressions. Before performing a regular expression search, these applications apply a filter to select which regular expressions must be searched. One way to search a regular expression is through a Nondeterministic Finite Automaton (NFA). Traversing an NFA is very time consuming on a sequential machine like a CPU. One solution so is to implement the NFA into hardware. Since FPGAs are reconfigurable and are massively parallel they are a good solution. Moreover, with the advent of platforms combining FPGAs and CPUs, implementing accelerators into FPGA becomes very interesting. Even though FPGAs are reconfigurable, the reconfiguration time can be too long in some cases. This paper thus proposes an overlay architecture that can efficiently find matches for regular expressions. The architecture contains multiple contexts that allow fast reconfiguration. Based on the results of a string filter, a context is selected and regular expression search is performed. The proposed design can support all rules from a set such as Snort while significantly reducing compute resources and allowing fast context updates. An example architecture was implemented on a Xilinx® xc7a200 Artix-7. It achieves a throughput of 100 million characters per second, requires 20 ns for a context switch, and occupies 9% of the slices and 85% of the BRAM resources of the FPGA.
Snort和Bro是用正则表达式表达复杂规则的深度包检测系统。在执行正则表达式搜索之前,这些应用程序应用筛选器来选择必须搜索的正则表达式。搜索正则表达式的一种方法是通过非确定性有限自动机(NFA)。在像CPU这样的顺序机器上遍历NFA非常耗时。一种解决方案是将NFA实现到硬件中。由于fpga是可重构的,并且是大规模并行的,因此它们是一个很好的解决方案。此外,随着FPGA和cpu结合平台的出现,在FPGA中实现加速器变得非常有趣。尽管fpga是可重构的,但在某些情况下,重构时间可能太长。因此,本文提出了一种能够有效地找到正则表达式匹配的覆盖体系结构。该体系结构包含允许快速重新配置的多个上下文。根据字符串过滤器的结果,选择上下文并执行正则表达式搜索。建议的设计可以支持Snort等集合中的所有规则,同时显著减少计算资源并允许快速上下文更新。在Xilinx®xc7a200 Artix-7上实现了一个示例架构。它实现了每秒1亿个字符的吞吐量,上下文切换需要20 ns,占用FPGA 9%的片和85%的BRAM资源。
{"title":"An FPGA Overlay Architecture for Cost Effective Regular Expression Search (Abstract Only)","authors":"Thomas Luinaud, Y. Savaria, J. Langlois","doi":"10.1145/3020078.3021770","DOIUrl":"https://doi.org/10.1145/3020078.3021770","url":null,"abstract":"Snort and Bro are Deep Packet Inspection systems which express complex rules with regular expressions. Before performing a regular expression search, these applications apply a filter to select which regular expressions must be searched. One way to search a regular expression is through a Nondeterministic Finite Automaton (NFA). Traversing an NFA is very time consuming on a sequential machine like a CPU. One solution so is to implement the NFA into hardware. Since FPGAs are reconfigurable and are massively parallel they are a good solution. Moreover, with the advent of platforms combining FPGAs and CPUs, implementing accelerators into FPGA becomes very interesting. Even though FPGAs are reconfigurable, the reconfiguration time can be too long in some cases. This paper thus proposes an overlay architecture that can efficiently find matches for regular expressions. The architecture contains multiple contexts that allow fast reconfiguration. Based on the results of a string filter, a context is selected and regular expression search is performed. The proposed design can support all rules from a set such as Snort while significantly reducing compute resources and allowing fast context updates. An example architecture was implemented on a Xilinx® xc7a200 Artix-7. It achieves a throughput of 100 million characters per second, requires 20 ns for a context switch, and occupies 9% of the slices and 85% of the BRAM resources of the FPGA.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"140 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116275014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs 用软件可编程fpga加速二值化卷积神经网络
Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, M. Srivastava, Rajesh K. Gupta, Zhiru Zhang
Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.
卷积神经网络(CNN)是当前许多计算机视觉任务的最先进技术。cnn在准确性上优于旧的方法,但需要大量的计算和内存。因此,现有的CNN应用程序通常在cpu或gpu集群上运行。对FPGA加速CNN工作负载的研究已经实现了功耗和能耗的降低。然而,大型gpu在吞吐量方面优于现代fpga,并且兼容深度学习框架的存在使gpu在可编程性方面具有显着优势。最近在机器学习方面的研究证明了极低精度cnn的潜力——即具有二值化权重和激活的cnn。这种二值化神经网络(bnn)似乎非常适合FPGA实现,因为它们的主要计算是位逻辑运算,并且它们的内存需求减少了。低精度网络和高级设计方法的结合可能有助于解决fpga和gpu之间的性能和生产力差距。在本文中,我们设计了一个BNN加速器,该加速器由c++合成为fpga目标Verilog。该加速器在GOPS以及能源和资源效率方面优于现有的基于fpga的CNN加速器。
{"title":"Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs","authors":"Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, M. Srivastava, Rajesh K. Gupta, Zhiru Zhang","doi":"10.1145/3020078.3021741","DOIUrl":"https://doi.org/10.1145/3020078.3021741","url":null,"abstract":"Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129833380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 365
A Mixed-Signal Data-Centric Reconfigurable Architecture enabled by RRAM Technology (Abstract Only) 基于RRAM技术的以数据为中心的混合信号可重构体系结构
Yue Zha, Jialiang Zhang, Zhiqiang Wei, J. Li
This poster presents a data-centric reconfigurable architecture, which is enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, it is inherently a homogeneous architecture comprising of a two-dimensional (2D) array of mixed-signal processing "tiles". Each tile can be configured into one or a combination of the four modes: logic, memory, TCAM, and interconnect. Computation within a tile is performed in analog domain for energy efficiency, whereas communication between tiles is performed in digital domain for resilience. Such flexibility allows users to partition resources based on applications' needs, in contrast to fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its "memory friendly" architecture effectively addressed the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained configuration results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counter part. Our preliminary study shows great promise of this architecture for improving performance, energy efficiency and security.
这张海报展示了一个以数据为中心的可重构架构,它是由新兴的非易失性存储器(即RRAM)实现的。与商用fpga的异构架构相比,它本质上是一个由二维(2D)混合信号处理“瓦片”阵列组成的同质架构。每个磁贴都可以配置为一种或四种模式的组合:逻辑模式、内存模式、TCAM模式和互连模式。为了提高能源效率,在模拟域中进行瓷砖内的计算,而在数字域中进行瓷砖之间的通信,以提高弹性。这种灵活性允许用户根据应用程序的需要对资源进行分区,而不是在fpga中使用专用硬IP块进行固定的硬件设计。除了更好的资源利用外,其“内存友好”架构有效地解决了商用fpga的局限性,即片上内存资源稀缺,使其成为fpga的有效补充。此外,它的粗粒度配置使其逻辑深度更浅,层间路由开销更少,因此与FPGA计数器部分相比,面积更小,性能更好。我们的初步研究表明,这种架构在提高性能、能源效率和安全性方面具有很大的前景。
{"title":"A Mixed-Signal Data-Centric Reconfigurable Architecture enabled by RRAM Technology (Abstract Only)","authors":"Yue Zha, Jialiang Zhang, Zhiqiang Wei, J. Li","doi":"10.1145/3020078.3021759","DOIUrl":"https://doi.org/10.1145/3020078.3021759","url":null,"abstract":"This poster presents a data-centric reconfigurable architecture, which is enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, it is inherently a homogeneous architecture comprising of a two-dimensional (2D) array of mixed-signal processing \"tiles\". Each tile can be configured into one or a combination of the four modes: logic, memory, TCAM, and interconnect. Computation within a tile is performed in analog domain for energy efficiency, whereas communication between tiles is performed in digital domain for resilience. Such flexibility allows users to partition resources based on applications' needs, in contrast to fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its \"memory friendly\" architecture effectively addressed the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained configuration results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counter part. Our preliminary study shows great promise of this architecture for improving performance, energy efficiency and security.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126507626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1