2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文中文

An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs 基于fpga的稀疏卷积神经网络硬件加速器

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00013

Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang

Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.

深度卷积神经网络(CNN)以巨大的计算代价取得了显著的性能。随着CNN模型变得越来越复杂和深入，通过修剪网络中的冗余连接将CNN压缩为稀疏已经成为一种有吸引力的方法，以减少计算量和内存需求。近年来，fpga已被证明是加速CNN推理的有效硬件平台。然而，大多数现有的FPGA架构都集中在密集的CNN模型上。为密集CNN模型设计的架构在执行稀疏模型时效率低下，因为大多数算术运算涉及零操作数的加法和乘法。另一方面，最近的稀疏FPGA加速器只关注FC层。在这项工作中，我们的目标是为稀疏cnn开发一个FPGA加速器。为了有效地处理稀疏卷积层中的不规则连接，我们提出了一种面向权值的数据流，该数据流对每个权值进行单独处理。然后设计了一种能够有效处理输入权值连接和权值输出连接的FPGA架构。对于输入权值连接，我们设计了一个平铺查找表来消除压缩权值的运行时索引匹配。此外，我们开发了一个重量布局，以实现高片上存储器访问。为了配合权重布局，插入一个通道复用器来定位地址，以确保没有数据访问冲突。实验表明，我们的加速器在Xilinx ZCU102上对现代CNN可以达到223.4-309.0 GOP/s，比以前的密集CNN FPGA加速器提供3.6x-12.9倍的加速。

{"title":"An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs","authors":"Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang","doi":"10.1109/FCCM.2019.00013","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00013","url":null,"abstract":"Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122453008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 88

A High Throughput and Energy-Efficient Retina-Inspired Tone Mapping Processor 一种高通量、高能效视网膜色调映射处理器

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00062

Lili Liu, Xiaoqiang Xiang, Yuxiang Xie, Yongjie Li, Bo Yan, Jun Zhou

This paper presents a high throughput and energy-efficient retina inspired tone mapping processor. Several hardware design techniques have been proposed to achieve high throughput and high energy efficiency, including data partition based parallel processing with S-shape sliding, adjacent frame feature sharing, multi-layer convolution pipelining and convolution filter compression with zero skipping convolution. The proposed processor has been implemented on a Xilinx's Virtex7 FPGA for demonstration. It is able to achieve a throughput of 189 frames per second for 1024*768 RGB images with 819 mW. Compared with several state-of-the-art tone mapping processors, the proposed processor achieves higher throughput and energy efficiency. It is suitable for high-speed and energy-constrained video enhancement applications such as autonomous vehicle and drone monitoring.

提出了一种高吞吐量、高能效的视网膜激发色调映射处理器。为了实现高吞吐量和高能效，提出了几种硬件设计技术，包括基于数据分区的s形滑动并行处理、相邻帧特征共享、多层卷积流水线和零跳变卷积滤波器压缩。该处理器已在Xilinx的Virtex7 FPGA上实现以进行演示。它能够实现每秒189帧的吞吐量，用于1024*768 RGB图像，功率为819 mW。与几种最新的色调映射处理器相比，该处理器具有更高的吞吐量和能效。它适用于高速和能量受限的视频增强应用，如自动驾驶汽车和无人机监控。

引用次数: 1

Automated Tool and Runtime Support for Fine-Grain Reconfiguration in Highly Flexible Reconfigurable Systems 高度灵活的可重构系统中精细重构的自动化工具和运行时支持

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00048

Rafael Zamacola, A. García-Martínez, J. Mora, A. Otero, E. D. L. Torre

Dynamic partial reconfiguration significantly reduces reconfiguration times when offloading a partial design. However, there are occasions when fine-tuning a circuit would greatly benefit from quicker reconfiguration times. To that end, authors present an automated tool and runtime support to reconfigure LUT-based multiplexers and constants. In contrast to conventional multiplexers and constants, it is possible to modify these components without having a direct communication with the static system.

当卸载部分设计时，动态部分重新配置显著减少了重新配置时间。然而，有些情况下，微调电路将大大受益于更快的重新配置时间。为此，作者提供了一个自动化工具和运行时支持来重新配置基于lut的多路复用器和常量。与传统的多路复用器和常量相比，可以在不与静态系统直接通信的情况下修改这些组件。

引用次数: 3

SimAcc: A Configurable Cycle-Accurate Simulator for Customized Accelerators on CPU-FPGAs SoCs SimAcc:用于cpu - fpga soc上定制加速器的可配置周期精确模拟器

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00031

Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján

This paper describes a flexible infrastructure for fast computer architecture simulation and prototyping of accelerator IP. A trend for System-on-Chips is to include application specific accelerators on the die. However, there is still a key research problem that needs to be addressed: How do hardware accelerators interact with the processors of a system and what is the impact on overall performance? To solve this problem, we propose an infrastructure that can directly simulate unmodified application executables with FPGA hardware accelerators. Unmodified application binaries are dynamically instrumented to generate processor load/store and program counter events and any memory accesses generated by accelerators, that are sent to an FPGA-based out-of-order pipeline model. The key features of our infrastructure are the ability to code exclusively at the user level, to dynamically discover and use available hardware models at run time, to test and simultaneously optimize hardware accelerators in an heterogeneous system. In terms of evaluation, we present a comparison between our system and Gem5 to demonstrate accuracy and relative performance, using the SPEC CPU benchmarks; even though our system is implemented on Zynq XC7045 which integrates dual 667MHz Arm Cortex-A9s with substantial FPGA resources, it outperforms Gem5 running on a Xeon E3 3.2 GHz with 32GBs of RAM. We also evaluate our infrastructure in simulating the interaction of accelerators with processors using accelerators taken from the Mach Benchmark Suite and other custom accelerators from computer vision applications.

本文描述了一种用于加速器IP快速计算机体系结构仿真和原型设计的灵活的基础架构。片上系统的一个趋势是在芯片上包含特定于应用程序的加速器。然而，仍然有一个关键的研究问题需要解决:硬件加速器如何与系统的处理器交互，以及对整体性能的影响是什么?为了解决这个问题，我们提出了一个基础架构，可以直接模拟未经修改的应用程序可执行程序与FPGA硬件加速器。未修改的应用程序二进制文件被动态地检测，以生成处理器加载/存储和程序计数器事件以及由加速器生成的任何内存访问，这些访问被发送到基于fpga的乱序管道模型。我们的基础设施的关键特征是能够在用户级别进行专门的编码，在运行时动态发现和使用可用的硬件模型，在异构系统中测试和同时优化硬件加速器。在评估方面，我们提出了我们的系统和Gem5之间的比较，以证明准确性和相对性能，使用SPEC CPU基准;尽管我们的系统是在Zynq XC7045上实现的，该系统集成了双667MHz Arm Cortex-A9s和大量FPGA资源，但它优于在具有32gb RAM的至强E3 3.2 GHz上运行的Gem5。我们还评估了模拟加速器与处理器交互的基础设施，使用的加速器来自Mach基准套件和其他自定义加速器来自计算机视觉应用程序。

{"title":"SimAcc: A Configurable Cycle-Accurate Simulator for Customized Accelerators on CPU-FPGAs SoCs","authors":"Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján","doi":"10.1109/FCCM.2019.00031","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00031","url":null,"abstract":"This paper describes a flexible infrastructure for fast computer architecture simulation and prototyping of accelerator IP. A trend for System-on-Chips is to include application specific accelerators on the die. However, there is still a key research problem that needs to be addressed: How do hardware accelerators interact with the processors of a system and what is the impact on overall performance? To solve this problem, we propose an infrastructure that can directly simulate unmodified application executables with FPGA hardware accelerators. Unmodified application binaries are dynamically instrumented to generate processor load/store and program counter events and any memory accesses generated by accelerators, that are sent to an FPGA-based out-of-order pipeline model. The key features of our infrastructure are the ability to code exclusively at the user level, to dynamically discover and use available hardware models at run time, to test and simultaneously optimize hardware accelerators in an heterogeneous system. In terms of evaluation, we present a comparison between our system and Gem5 to demonstrate accuracy and relative performance, using the SPEC CPU benchmarks; even though our system is implemented on Zynq XC7045 which integrates dual 667MHz Arm Cortex-A9s with substantial FPGA resources, it outperforms Gem5 running on a Xeon E3 3.2 GHz with 32GBs of RAM. We also evaluate our infrastructure in simulating the interaction of accelerators with processors using accelerators taken from the Mach Benchmark Suite and other custom accelerators from computer vision applications.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

EFCAD — An Embedded FPGA CAD Tool Flow for Enabling On-chip Self-Compilation 实现片上自编译的嵌入式FPGA CAD工具流

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00011

K. Pham, Malte Vesper, Dirk Koch, Eddie Hung

This paper combines a chain of academic tools to form an FPGA compilation flow for building partially reconfigurable modules on lightweight embedded platforms. Our flow — EFCAD — supports the entire stack from RTL (Verilog) to (partial) bitstream, and we demonstrate early results from the onchip ARM processor of, and targeting, the latest 16nm generation of a Zynq UltraScale+ MPSoC device. With this, we complement Xilinx's PYNQ initiative to not only facilitate System-on-Chip research and education entirely within an embedded system, but also to allow building new and specialising existing customcomputing accelerators without needing access to a workstation.

本文结合了一系列学术工具，形成了一个FPGA编译流程，用于在轻量级嵌入式平台上构建部分可重构模块。我们的流程- EFCAD -支持从RTL (Verilog)到(部分)比特流的整个堆栈，并且我们展示了片上ARM处理器的早期结果，并针对最新的16nm一代Zynq UltraScale+ MPSoC设备。有了这个，我们补充了赛灵思的PYNQ计划，不仅促进了嵌入式系统内的片上系统研究和教育，而且还允许在不需要访问工作站的情况下构建新的和专业的现有定制计算加速器。

引用次数: 4

Exploiting Irregular Memory Parallelism in Quasi-Stencils through Nonlinear Transformation 利用非线性变换开发准模板的不规则存储并行性

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00039

Juan Escobedo, Mingjie Lin

Non-stencil kernels with irregular memory accesses pose unique challenges to achieving high computing performance and hardware efficiency in high-level synthesis (HLS) of FPGA. We present a highly versatile and systematic approach to effectively synthesizing a special and important subset of non-stencil computing kernels, quasi-stencils, which possess the mathematical property that, if studied in a particular kind of high-dimensional space corresponding to the prime factorization space, the distance between the memory accesses during each kernel iteration becomes constant and such an irregular non-stencil can be considered as a stencil. This opens the door to exploiting a vast array of existing memory optimization algorithms, such as memory partitioning/banking and data reuse, originally designed for the standard stencil-based kernel computing, therefore offering totally new opportunity to effectively synthesizing irregular non-stencil kernels. We show the feasibility of our approach implementing our methodology in a KC705 Xilinx FPGA board and tested it with several custom code segments that meet the quasi-stencil requirement vs some of the state-of the art methods in memory partitioning. We achieve significant reduction in partition factor, and perhaps more importantly making it proportional to the number of memory accesses instead of depending on the problem size with the cost of some wasted space.

在FPGA的高级综合(high-level synthesis, HLS)中，具有不规则存储器访问的非模板内核对实现高计算性能和硬件效率提出了独特的挑战。我们提出了一种高度通用和系统的方法来有效地综合非模板计算核的一个特殊而重要的子集——准模板。准模板具有这样的数学性质:如果在与质因数分解空间相对应的特定高维空间中进行研究，则每次内核迭代期间存储器访问之间的距离是恒定的，这样的不规则非模板可以被认为是一个模板。这为开发大量现有的内存优化算法打开了大门，例如内存分区/银行和数据重用，这些算法最初是为标准的基于模板的内核计算设计的，因此为有效地合成不规则的非模板内核提供了全新的机会。我们展示了在KC705 Xilinx FPGA板上实现我们的方法的可行性，并使用几个满足准模板要求的自定义代码段对内存分区中的一些最先进的方法进行了测试。我们实现了分区因子的显著降低，也许更重要的是使其与内存访问的数量成正比，而不是依赖于问题的大小，从而浪费了一些空间。

{"title":"Exploiting Irregular Memory Parallelism in Quasi-Stencils through Nonlinear Transformation","authors":"Juan Escobedo, Mingjie Lin","doi":"10.1109/FCCM.2019.00039","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00039","url":null,"abstract":"Non-stencil kernels with irregular memory accesses pose unique challenges to achieving high computing performance and hardware efficiency in high-level synthesis (HLS) of FPGA. We present a highly versatile and systematic approach to effectively synthesizing a special and important subset of non-stencil computing kernels, quasi-stencils, which possess the mathematical property that, if studied in a particular kind of high-dimensional space corresponding to the prime factorization space, the distance between the memory accesses during each kernel iteration becomes constant and such an irregular non-stencil can be considered as a stencil. This opens the door to exploiting a vast array of existing memory optimization algorithms, such as memory partitioning/banking and data reuse, originally designed for the standard stencil-based kernel computing, therefore offering totally new opportunity to effectively synthesizing irregular non-stencil kernels. We show the feasibility of our approach implementing our methodology in a KC705 Xilinx FPGA board and tested it with several custom code segments that meet the quasi-stencil requirement vs some of the state-of the art methods in memory partitioning. We achieve significant reduction in partition factor, and perhaps more importantly making it proportional to the number of memory accesses instead of depending on the problem size with the cost of some wasted space.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125560401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Monobit Wideband Receiver with Integrated Dithering in FPGA FPGA中集成抖动的单比特宽带接收机

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00073

Dan Pritsker, Colman Cheung

This work presents an innovative and very competitive approach to re-purpose FPGA digital high-speed transceivers to sample wideband analog signals while achieving excellent sampling quality. Such solution can achieve 16+GHz instantaneous bandwidth using existing technology in Stratix-V FPGA family

这项工作提出了一种创新和非常有竞争力的方法，重新利用FPGA数字高速收发器对宽带模拟信号进行采样，同时实现出色的采样质量。该方案利用Stratix-V FPGA家族现有技术可实现16+GHz瞬时带宽

引用次数: 1

FASE: FPGA Acceleration of Secure Function Evaluation FASE: FPGA加速安全功能评估

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00045

S. Hussain, F. Koushanfar

We present FASE, an FPGA accelerator for Secure Function Evaluation (SFE) by employing the well-known cryptographic protocol named Yao's Garbled Circuit (GC). SFE allows two parties to jointly compute a function on their private data and learn the output without revealing their inputs to each other. FASE is designed to allow cloud servers to provide secure services to a large number of clients in parallel while preserving the privacy of the data from both sides. Current SFE accelerators either target specific applications, and therefore are not amenable to generic use, or have low throughput due to inefficient management of resources. In this work, we present a pipelined architecture along with an efficient scheduling scheme to ensure optimal usage of the available resources. The scheme is built around a simulator of the hardware design that schedules the workload and assigns the most suitable task to the encryption cores at each cycle. This, coupled with optimal management of the read and write cycles of the Block RAM on FPGA, results in a minimum 2 orders of magnitude improvement in terms of throughput per core for the reported benchmarks compared to the most recent generic GC accelerator. Moreover, our encryption core requires 17% less resource compared to the most recent secure GC realization.

我们提出了FASE，一种用于安全功能评估(SFE)的FPGA加速器，它采用了著名的密码协议姚氏乱码电路(GC)。SFE允许双方在他们的私有数据上共同计算一个函数，并在不向对方透露他们的输入的情况下学习输出。FASE旨在允许云服务器为大量客户端并行提供安全服务，同时保护双方数据的隐私性。当前的SFE加速器要么针对特定的应用程序，因此不适合通用使用，要么由于资源管理效率低下而具有低吞吐量。在这项工作中，我们提出了一个流水线架构以及一个有效的调度方案，以确保可用资源的最佳利用。该方案是围绕硬件设计的模拟器构建的，该模拟器可以调度工作负载，并在每个周期将最合适的任务分配给加密核心。这一点，再加上FPGA上块RAM的读写周期的优化管理，与最新的通用GC加速器相比，在报告的基准测试中，每核的吞吐量至少提高了2个数量级。此外，与最新的安全GC实现相比，我们的加密核心需要的资源减少了17%。

{"title":"FASE: FPGA Acceleration of Secure Function Evaluation","authors":"S. Hussain, F. Koushanfar","doi":"10.1109/FCCM.2019.00045","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00045","url":null,"abstract":"We present FASE, an FPGA accelerator for Secure Function Evaluation (SFE) by employing the well-known cryptographic protocol named Yao's Garbled Circuit (GC). SFE allows two parties to jointly compute a function on their private data and learn the output without revealing their inputs to each other. FASE is designed to allow cloud servers to provide secure services to a large number of clients in parallel while preserving the privacy of the data from both sides. Current SFE accelerators either target specific applications, and therefore are not amenable to generic use, or have low throughput due to inefficient management of resources. In this work, we present a pipelined architecture along with an efficient scheduling scheme to ensure optimal usage of the available resources. The scheme is built around a simulator of the hardware design that schedules the workload and assigns the most suitable task to the encryption cores at each cycle. This, coupled with optimal management of the read and write cycles of the Block RAM on FPGA, results in a minimum 2 orders of magnitude improvement in terms of throughput per core for the reported benchmarks compared to the most recent generic GC accelerator. Moreover, our encryption core requires 17% less resource compared to the most recent secure GC realization.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"271 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115819131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Deep Packet Inspection in FPGAs via Approximate Nondeterministic Automata 基于近似不确定性自动机的fpga深度包检测

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00025

Milan Ceska, Vojtěch Havlena, L. Holík, J. Korenek, Ondřej Lengál, Denis Matousek, J. Matoušek, Jakub Semric, Tomáš Vojnar

Deep packet inspection via regular expression (RE) matching is a crucial task of network intrusion detection systems (IDSes), which secure Internet connection against attacks and suspicious network traffic. Monitoring high-speed computer networks (100 Gbps and faster) in a single-box solution demands that the RE matching, traditionally based on finite automata (FAs), is accelerated in hardware. In this paper, we describe a novel FPGA architecture for RE matching that is able to process network traffic beyond 100 Gbps. The key idea is to reduce the required FPGA resources by leveraging approximate nondeterministic FAs (NFAs). The NFAs are compiled into a multi-stage architecture starting with the least precise stage with a high throughput and ending with the most precise stage with a low throughput. To obtain the reduced NFAs, we propose new approximate reduction techniques that take into account the profile of the network traffic. Our experiments showed that using our approach, we were able to perform matching of large sets of REs from SNORT, a popular IDS, on unprecedented network speeds.

基于正则表达式匹配的深度包检测是网络入侵检测系统(ids)的一项重要任务，它可以保护互联网连接免受攻击和可疑网络流量的攻击。在单箱解决方案中监控高速计算机网络(100gbps或更快)要求传统上基于有限自动机(FAs)的正则匹配在硬件中加速。在本文中，我们描述了一种新的FPGA架构，用于RE匹配，能够处理超过100 Gbps的网络流量。关键思想是通过利用近似不确定性fa (nfa)来减少所需的FPGA资源。NFAs被编译成一个多阶段架构，从具有高吞吐量的最不精确阶段开始，到具有低吞吐量的最精确阶段结束。为了获得减少的nfa，我们提出了新的近似减少技术，考虑到网络流量的概况。我们的实验表明，使用我们的方法，我们能够以前所未有的网络速度对SNORT(一种流行的IDS)中的大型REs集进行匹配。

{"title":"Deep Packet Inspection in FPGAs via Approximate Nondeterministic Automata","authors":"Milan Ceska, Vojtěch Havlena, L. Holík, J. Korenek, Ondřej Lengál, Denis Matousek, J. Matoušek, Jakub Semric, Tomáš Vojnar","doi":"10.1109/FCCM.2019.00025","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00025","url":null,"abstract":"Deep packet inspection via regular expression (RE) matching is a crucial task of network intrusion detection systems (IDSes), which secure Internet connection against attacks and suspicious network traffic. Monitoring high-speed computer networks (100 Gbps and faster) in a single-box solution demands that the RE matching, traditionally based on finite automata (FAs), is accelerated in hardware. In this paper, we describe a novel FPGA architecture for RE matching that is able to process network traffic beyond 100 Gbps. The key idea is to reduce the required FPGA resources by leveraging approximate nondeterministic FAs (NFAs). The NFAs are compiled into a multi-stage architecture starting with the least precise stage with a high throughput and ending with the most precise stage with a low throughput. To obtain the reduced NFAs, we propose new approximate reduction techniques that take into account the profile of the network traffic. Our experiments showed that using our approach, we were able to perform matching of large sets of REs from SNORT, a popular IDS, on unprecedented network speeds.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"323 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129773658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks PIR-DSP:一种用于多精度深度神经网络的FPGA DSP块结构

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00015

Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong

Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.

量化是提高浮点深度神经网络(DNN)加速器性能的关键优化策略。当加速器精度远低于DSP精度时，现场可编程门阵列上的数字信号处理(DSP)模块得不到有效利用。通过对Xilinx DSP48E2 DSP模块的三次修改，我们解决了嵌入式DNN加速器中重要计算的这个问题，即标准层、深度层和点卷积层。首先，我们为CNN的实现提出了一个灵活的精度、运行时可分解的乘法器架构。其次，我们提出对DSPDSP互连进行重大升级，提供支持我们的低精度乘法器的半2d低精度链功能。最后，我们通过一个也可以配置为FIFO的寄存器文件来提高数据重用。与Xilinx DSP48E2中的27 × 18位模式相比，我们的Precision, Interconnect和reuse - optimization DSP (PIR-DSP)在9 × 9位情况下每个DSP的乘法累加操作提高了6倍，4× 4位提高了12倍，2× 2位提高了24倍。我们估计在9/4/2位MobileNet-v2 DNN实现中，PIR-DSP将运行时能量降低到原始值的31/19/13%。

{"title":"PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks","authors":"Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong","doi":"10.1109/FCCM.2019.00015","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00015","url":null,"abstract":"Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129506334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀