首页 > 最新文献

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文 中文
An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs 基于fpga的稀疏卷积神经网络硬件加速器
Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang
Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.
深度卷积神经网络(CNN)以巨大的计算代价取得了显著的性能。随着CNN模型变得越来越复杂和深入,通过修剪网络中的冗余连接将CNN压缩为稀疏已经成为一种有吸引力的方法,以减少计算量和内存需求。近年来,fpga已被证明是加速CNN推理的有效硬件平台。然而,大多数现有的FPGA架构都集中在密集的CNN模型上。为密集CNN模型设计的架构在执行稀疏模型时效率低下,因为大多数算术运算涉及零操作数的加法和乘法。另一方面,最近的稀疏FPGA加速器只关注FC层。在这项工作中,我们的目标是为稀疏cnn开发一个FPGA加速器。为了有效地处理稀疏卷积层中的不规则连接,我们提出了一种面向权值的数据流,该数据流对每个权值进行单独处理。然后设计了一种能够有效处理输入权值连接和权值输出连接的FPGA架构。对于输入权值连接,我们设计了一个平铺查找表来消除压缩权值的运行时索引匹配。此外,我们开发了一个重量布局,以实现高片上存储器访问。为了配合权重布局,插入一个通道复用器来定位地址,以确保没有数据访问冲突。实验表明,我们的加速器在Xilinx ZCU102上对现代CNN可以达到223.4-309.0 GOP/s,比以前的密集CNN FPGA加速器提供3.6x-12.9倍的加速。
{"title":"An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs","authors":"Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang","doi":"10.1109/FCCM.2019.00013","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00013","url":null,"abstract":"Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122453008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 88
A High Throughput and Energy-Efficient Retina-Inspired Tone Mapping Processor 一种高通量、高能效视网膜色调映射处理器
Lili Liu, Xiaoqiang Xiang, Yuxiang Xie, Yongjie Li, Bo Yan, Jun Zhou
This paper presents a high throughput and energy-efficient retina inspired tone mapping processor. Several hardware design techniques have been proposed to achieve high throughput and high energy efficiency, including data partition based parallel processing with S-shape sliding, adjacent frame feature sharing, multi-layer convolution pipelining and convolution filter compression with zero skipping convolution. The proposed processor has been implemented on a Xilinx's Virtex7 FPGA for demonstration. It is able to achieve a throughput of 189 frames per second for 1024*768 RGB images with 819 mW. Compared with several state-of-the-art tone mapping processors, the proposed processor achieves higher throughput and energy efficiency. It is suitable for high-speed and energy-constrained video enhancement applications such as autonomous vehicle and drone monitoring.
提出了一种高吞吐量、高能效的视网膜激发色调映射处理器。为了实现高吞吐量和高能效,提出了几种硬件设计技术,包括基于数据分区的s形滑动并行处理、相邻帧特征共享、多层卷积流水线和零跳变卷积滤波器压缩。该处理器已在Xilinx的Virtex7 FPGA上实现以进行演示。它能够实现每秒189帧的吞吐量,用于1024*768 RGB图像,功率为819 mW。与几种最新的色调映射处理器相比,该处理器具有更高的吞吐量和能效。它适用于高速和能量受限的视频增强应用,如自动驾驶汽车和无人机监控。
{"title":"A High Throughput and Energy-Efficient Retina-Inspired Tone Mapping Processor","authors":"Lili Liu, Xiaoqiang Xiang, Yuxiang Xie, Yongjie Li, Bo Yan, Jun Zhou","doi":"10.1109/FCCM.2019.00062","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00062","url":null,"abstract":"This paper presents a high throughput and energy-efficient retina inspired tone mapping processor. Several hardware design techniques have been proposed to achieve high throughput and high energy efficiency, including data partition based parallel processing with S-shape sliding, adjacent frame feature sharing, multi-layer convolution pipelining and convolution filter compression with zero skipping convolution. The proposed processor has been implemented on a Xilinx's Virtex7 FPGA for demonstration. It is able to achieve a throughput of 189 frames per second for 1024*768 RGB images with 819 mW. Compared with several state-of-the-art tone mapping processors, the proposed processor achieves higher throughput and energy efficiency. It is suitable for high-speed and energy-constrained video enhancement applications such as autonomous vehicle and drone monitoring.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127004166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automated Tool and Runtime Support for Fine-Grain Reconfiguration in Highly Flexible Reconfigurable Systems 高度灵活的可重构系统中精细重构的自动化工具和运行时支持
Rafael Zamacola, A. García-Martínez, J. Mora, A. Otero, E. D. L. Torre
Dynamic partial reconfiguration significantly reduces reconfiguration times when offloading a partial design. However, there are occasions when fine-tuning a circuit would greatly benefit from quicker reconfiguration times. To that end, authors present an automated tool and runtime support to reconfigure LUT-based multiplexers and constants. In contrast to conventional multiplexers and constants, it is possible to modify these components without having a direct communication with the static system.
当卸载部分设计时,动态部分重新配置显著减少了重新配置时间。然而,有些情况下,微调电路将大大受益于更快的重新配置时间。为此,作者提供了一个自动化工具和运行时支持来重新配置基于lut的多路复用器和常量。与传统的多路复用器和常量相比,可以在不与静态系统直接通信的情况下修改这些组件。
{"title":"Automated Tool and Runtime Support for Fine-Grain Reconfiguration in Highly Flexible Reconfigurable Systems","authors":"Rafael Zamacola, A. García-Martínez, J. Mora, A. Otero, E. D. L. Torre","doi":"10.1109/FCCM.2019.00048","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00048","url":null,"abstract":"Dynamic partial reconfiguration significantly reduces reconfiguration times when offloading a partial design. However, there are occasions when fine-tuning a circuit would greatly benefit from quicker reconfiguration times. To that end, authors present an automated tool and runtime support to reconfigure LUT-based multiplexers and constants. In contrast to conventional multiplexers and constants, it is possible to modify these components without having a direct communication with the static system.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132991746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SimAcc: A Configurable Cycle-Accurate Simulator for Customized Accelerators on CPU-FPGAs SoCs SimAcc:用于cpu - fpga soc上定制加速器的可配置周期精确模拟器
Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján
This paper describes a flexible infrastructure for fast computer architecture simulation and prototyping of accelerator IP. A trend for System-on-Chips is to include application specific accelerators on the die. However, there is still a key research problem that needs to be addressed: How do hardware accelerators interact with the processors of a system and what is the impact on overall performance? To solve this problem, we propose an infrastructure that can directly simulate unmodified application executables with FPGA hardware accelerators. Unmodified application binaries are dynamically instrumented to generate processor load/store and program counter events and any memory accesses generated by accelerators, that are sent to an FPGA-based out-of-order pipeline model. The key features of our infrastructure are the ability to code exclusively at the user level, to dynamically discover and use available hardware models at run time, to test and simultaneously optimize hardware accelerators in an heterogeneous system. In terms of evaluation, we present a comparison between our system and Gem5 to demonstrate accuracy and relative performance, using the SPEC CPU benchmarks; even though our system is implemented on Zynq XC7045 which integrates dual 667MHz Arm Cortex-A9s with substantial FPGA resources, it outperforms Gem5 running on a Xeon E3 3.2 GHz with 32GBs of RAM. We also evaluate our infrastructure in simulating the interaction of accelerators with processors using accelerators taken from the Mach Benchmark Suite and other custom accelerators from computer vision applications.
本文描述了一种用于加速器IP快速计算机体系结构仿真和原型设计的灵活的基础架构。片上系统的一个趋势是在芯片上包含特定于应用程序的加速器。然而,仍然有一个关键的研究问题需要解决:硬件加速器如何与系统的处理器交互,以及对整体性能的影响是什么?为了解决这个问题,我们提出了一个基础架构,可以直接模拟未经修改的应用程序可执行程序与FPGA硬件加速器。未修改的应用程序二进制文件被动态地检测,以生成处理器加载/存储和程序计数器事件以及由加速器生成的任何内存访问,这些访问被发送到基于fpga的乱序管道模型。我们的基础设施的关键特征是能够在用户级别进行专门的编码,在运行时动态发现和使用可用的硬件模型,在异构系统中测试和同时优化硬件加速器。在评估方面,我们提出了我们的系统和Gem5之间的比较,以证明准确性和相对性能,使用SPEC CPU基准;尽管我们的系统是在Zynq XC7045上实现的,该系统集成了双667MHz Arm Cortex-A9s和大量FPGA资源,但它优于在具有32gb RAM的至强E3 3.2 GHz上运行的Gem5。我们还评估了模拟加速器与处理器交互的基础设施,使用的加速器来自Mach基准套件和其他自定义加速器来自计算机视觉应用程序。
{"title":"SimAcc: A Configurable Cycle-Accurate Simulator for Customized Accelerators on CPU-FPGAs SoCs","authors":"Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján","doi":"10.1109/FCCM.2019.00031","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00031","url":null,"abstract":"This paper describes a flexible infrastructure for fast computer architecture simulation and prototyping of accelerator IP. A trend for System-on-Chips is to include application specific accelerators on the die. However, there is still a key research problem that needs to be addressed: How do hardware accelerators interact with the processors of a system and what is the impact on overall performance? To solve this problem, we propose an infrastructure that can directly simulate unmodified application executables with FPGA hardware accelerators. Unmodified application binaries are dynamically instrumented to generate processor load/store and program counter events and any memory accesses generated by accelerators, that are sent to an FPGA-based out-of-order pipeline model. The key features of our infrastructure are the ability to code exclusively at the user level, to dynamically discover and use available hardware models at run time, to test and simultaneously optimize hardware accelerators in an heterogeneous system. In terms of evaluation, we present a comparison between our system and Gem5 to demonstrate accuracy and relative performance, using the SPEC CPU benchmarks; even though our system is implemented on Zynq XC7045 which integrates dual 667MHz Arm Cortex-A9s with substantial FPGA resources, it outperforms Gem5 running on a Xeon E3 3.2 GHz with 32GBs of RAM. We also evaluate our infrastructure in simulating the interaction of accelerators with processors using accelerators taken from the Mach Benchmark Suite and other custom accelerators from computer vision applications.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
EFCAD — An Embedded FPGA CAD Tool Flow for Enabling On-chip Self-Compilation 实现片上自编译的嵌入式FPGA CAD工具流
K. Pham, Malte Vesper, Dirk Koch, Eddie Hung
This paper combines a chain of academic tools to form an FPGA compilation flow for building partially reconfigurable modules on lightweight embedded platforms. Our flow — EFCAD — supports the entire stack from RTL (Verilog) to (partial) bitstream, and we demonstrate early results from the onchip ARM processor of, and targeting, the latest 16nm generation of a Zynq UltraScale+ MPSoC device. With this, we complement Xilinx's PYNQ initiative to not only facilitate System-on-Chip research and education entirely within an embedded system, but also to allow building new and specialising existing customcomputing accelerators without needing access to a workstation.
本文结合了一系列学术工具,形成了一个FPGA编译流程,用于在轻量级嵌入式平台上构建部分可重构模块。我们的流程- EFCAD -支持从RTL (Verilog)到(部分)比特流的整个堆栈,并且我们展示了片上ARM处理器的早期结果,并针对最新的16nm一代Zynq UltraScale+ MPSoC设备。有了这个,我们补充了赛灵思的PYNQ计划,不仅促进了嵌入式系统内的片上系统研究和教育,而且还允许在不需要访问工作站的情况下构建新的和专业的现有定制计算加速器。
{"title":"EFCAD — An Embedded FPGA CAD Tool Flow for Enabling On-chip Self-Compilation","authors":"K. Pham, Malte Vesper, Dirk Koch, Eddie Hung","doi":"10.1109/FCCM.2019.00011","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00011","url":null,"abstract":"This paper combines a chain of academic tools to form an FPGA compilation flow for building partially reconfigurable modules on lightweight embedded platforms. Our flow — EFCAD — supports the entire stack from RTL (Verilog) to (partial) bitstream, and we demonstrate early results from the onchip ARM processor of, and targeting, the latest 16nm generation of a Zynq UltraScale+ MPSoC device. With this, we complement Xilinx's PYNQ initiative to not only facilitate System-on-Chip research and education entirely within an embedded system, but also to allow building new and specialising existing customcomputing accelerators without needing access to a workstation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130747768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploiting Irregular Memory Parallelism in Quasi-Stencils through Nonlinear Transformation 利用非线性变换开发准模板的不规则存储并行性
Juan Escobedo, Mingjie Lin
Non-stencil kernels with irregular memory accesses pose unique challenges to achieving high computing performance and hardware efficiency in high-level synthesis (HLS) of FPGA. We present a highly versatile and systematic approach to effectively synthesizing a special and important subset of non-stencil computing kernels, quasi-stencils, which possess the mathematical property that, if studied in a particular kind of high-dimensional space corresponding to the prime factorization space, the distance between the memory accesses during each kernel iteration becomes constant and such an irregular non-stencil can be considered as a stencil. This opens the door to exploiting a vast array of existing memory optimization algorithms, such as memory partitioning/banking and data reuse, originally designed for the standard stencil-based kernel computing, therefore offering totally new opportunity to effectively synthesizing irregular non-stencil kernels. We show the feasibility of our approach implementing our methodology in a KC705 Xilinx FPGA board and tested it with several custom code segments that meet the quasi-stencil requirement vs some of the state-of the art methods in memory partitioning. We achieve significant reduction in partition factor, and perhaps more importantly making it proportional to the number of memory accesses instead of depending on the problem size with the cost of some wasted space.
在FPGA的高级综合(high-level synthesis, HLS)中,具有不规则存储器访问的非模板内核对实现高计算性能和硬件效率提出了独特的挑战。我们提出了一种高度通用和系统的方法来有效地综合非模板计算核的一个特殊而重要的子集——准模板。准模板具有这样的数学性质:如果在与质因数分解空间相对应的特定高维空间中进行研究,则每次内核迭代期间存储器访问之间的距离是恒定的,这样的不规则非模板可以被认为是一个模板。这为开发大量现有的内存优化算法打开了大门,例如内存分区/银行和数据重用,这些算法最初是为标准的基于模板的内核计算设计的,因此为有效地合成不规则的非模板内核提供了全新的机会。我们展示了在KC705 Xilinx FPGA板上实现我们的方法的可行性,并使用几个满足准模板要求的自定义代码段对内存分区中的一些最先进的方法进行了测试。我们实现了分区因子的显著降低,也许更重要的是使其与内存访问的数量成正比,而不是依赖于问题的大小,从而浪费了一些空间。
{"title":"Exploiting Irregular Memory Parallelism in Quasi-Stencils through Nonlinear Transformation","authors":"Juan Escobedo, Mingjie Lin","doi":"10.1109/FCCM.2019.00039","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00039","url":null,"abstract":"Non-stencil kernels with irregular memory accesses pose unique challenges to achieving high computing performance and hardware efficiency in high-level synthesis (HLS) of FPGA. We present a highly versatile and systematic approach to effectively synthesizing a special and important subset of non-stencil computing kernels, quasi-stencils, which possess the mathematical property that, if studied in a particular kind of high-dimensional space corresponding to the prime factorization space, the distance between the memory accesses during each kernel iteration becomes constant and such an irregular non-stencil can be considered as a stencil. This opens the door to exploiting a vast array of existing memory optimization algorithms, such as memory partitioning/banking and data reuse, originally designed for the standard stencil-based kernel computing, therefore offering totally new opportunity to effectively synthesizing irregular non-stencil kernels. We show the feasibility of our approach implementing our methodology in a KC705 Xilinx FPGA board and tested it with several custom code segments that meet the quasi-stencil requirement vs some of the state-of the art methods in memory partitioning. We achieve significant reduction in partition factor, and perhaps more importantly making it proportional to the number of memory accesses instead of depending on the problem size with the cost of some wasted space.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125560401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Monobit Wideband Receiver with Integrated Dithering in FPGA FPGA中集成抖动的单比特宽带接收机
Dan Pritsker, Colman Cheung
This work presents an innovative and very competitive approach to re-purpose FPGA digital high-speed transceivers to sample wideband analog signals while achieving excellent sampling quality. Such solution can achieve 16+GHz instantaneous bandwidth using existing technology in Stratix-V FPGA family
这项工作提出了一种创新和非常有竞争力的方法,重新利用FPGA数字高速收发器对宽带模拟信号进行采样,同时实现出色的采样质量。该方案利用Stratix-V FPGA家族现有技术可实现16+GHz瞬时带宽
{"title":"Monobit Wideband Receiver with Integrated Dithering in FPGA","authors":"Dan Pritsker, Colman Cheung","doi":"10.1109/FCCM.2019.00073","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00073","url":null,"abstract":"This work presents an innovative and very competitive approach to re-purpose FPGA digital high-speed transceivers to sample wideband analog signals while achieving excellent sampling quality. Such solution can achieve 16+GHz instantaneous bandwidth using existing technology in Stratix-V FPGA family","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126913656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FASE: FPGA Acceleration of Secure Function Evaluation FASE: FPGA加速安全功能评估
S. Hussain, F. Koushanfar
We present FASE, an FPGA accelerator for Secure Function Evaluation (SFE) by employing the well-known cryptographic protocol named Yao's Garbled Circuit (GC). SFE allows two parties to jointly compute a function on their private data and learn the output without revealing their inputs to each other. FASE is designed to allow cloud servers to provide secure services to a large number of clients in parallel while preserving the privacy of the data from both sides. Current SFE accelerators either target specific applications, and therefore are not amenable to generic use, or have low throughput due to inefficient management of resources. In this work, we present a pipelined architecture along with an efficient scheduling scheme to ensure optimal usage of the available resources. The scheme is built around a simulator of the hardware design that schedules the workload and assigns the most suitable task to the encryption cores at each cycle. This, coupled with optimal management of the read and write cycles of the Block RAM on FPGA, results in a minimum 2 orders of magnitude improvement in terms of throughput per core for the reported benchmarks compared to the most recent generic GC accelerator. Moreover, our encryption core requires 17% less resource compared to the most recent secure GC realization.
我们提出了FASE,一种用于安全功能评估(SFE)的FPGA加速器,它采用了著名的密码协议姚氏乱码电路(GC)。SFE允许双方在他们的私有数据上共同计算一个函数,并在不向对方透露他们的输入的情况下学习输出。FASE旨在允许云服务器为大量客户端并行提供安全服务,同时保护双方数据的隐私性。当前的SFE加速器要么针对特定的应用程序,因此不适合通用使用,要么由于资源管理效率低下而具有低吞吐量。在这项工作中,我们提出了一个流水线架构以及一个有效的调度方案,以确保可用资源的最佳利用。该方案是围绕硬件设计的模拟器构建的,该模拟器可以调度工作负载,并在每个周期将最合适的任务分配给加密核心。这一点,再加上FPGA上块RAM的读写周期的优化管理,与最新的通用GC加速器相比,在报告的基准测试中,每核的吞吐量至少提高了2个数量级。此外,与最新的安全GC实现相比,我们的加密核心需要的资源减少了17%。
{"title":"FASE: FPGA Acceleration of Secure Function Evaluation","authors":"S. Hussain, F. Koushanfar","doi":"10.1109/FCCM.2019.00045","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00045","url":null,"abstract":"We present FASE, an FPGA accelerator for Secure Function Evaluation (SFE) by employing the well-known cryptographic protocol named Yao's Garbled Circuit (GC). SFE allows two parties to jointly compute a function on their private data and learn the output without revealing their inputs to each other. FASE is designed to allow cloud servers to provide secure services to a large number of clients in parallel while preserving the privacy of the data from both sides. Current SFE accelerators either target specific applications, and therefore are not amenable to generic use, or have low throughput due to inefficient management of resources. In this work, we present a pipelined architecture along with an efficient scheduling scheme to ensure optimal usage of the available resources. The scheme is built around a simulator of the hardware design that schedules the workload and assigns the most suitable task to the encryption cores at each cycle. This, coupled with optimal management of the read and write cycles of the Block RAM on FPGA, results in a minimum 2 orders of magnitude improvement in terms of throughput per core for the reported benchmarks compared to the most recent generic GC accelerator. Moreover, our encryption core requires 17% less resource compared to the most recent secure GC realization.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"271 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115819131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
π-BA: Bundle Adjustment Acceleration on Embedded FPGAs with Co-observation Optimization π-BA:基于协同观测优化的嵌入式fpga束调整加速
S. Qin, Qiang Liu, Bo Yu, Shaoshan Liu
Bundle adjustment (BA) is a fundamental optimization technique used in many crucial applications, including 3D scene reconstruction, robotic localization, camera calibration, autonomous driving, space exploration, street view map generation etc. Essentially, BA is a joint non-linear optimization problem, and one which can consume a significant amount of time and power, especially for large optimization problems. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper we propose π-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC that exploits custom hardware for higher performance and power efficiency. Specifically, based on our key observation that not all points appear on all images in a BA problem, we designed and implemented a Co-Observation Optimization technique to accelerate BA operations with optimized usage of memory and computation resources. Experimental results confirm that π-BA outperforms the existing software implementations in terms of performance and power consumption.
束平差(BA)是一种基本的优化技术,用于许多关键应用,包括3D场景重建,机器人定位,相机校准,自动驾驶,空间探索,街景地图生成等。从本质上讲,BA是一个联合非线性优化问题,它会消耗大量的时间和精力,特别是对于大型优化问题。以前优化BA性能的方法严重依赖于并行处理或分布式计算,这些方法以更高的功耗换取更高的性能。在本文中,我们提出π-BA,这是第一个基于嵌入式FPGA-SoC的硬件软件协同设计的BA引擎,它利用定制硬件来获得更高的性能和功耗效率。具体而言,基于我们在BA问题中并非所有点都出现在所有图像上的关键观察,我们设计并实现了一种协同观察优化技术,通过优化内存和计算资源的使用来加速BA操作。实验结果证实π-BA在性能和功耗方面都优于现有的软件实现。
{"title":"π-BA: Bundle Adjustment Acceleration on Embedded FPGAs with Co-observation Optimization","authors":"S. Qin, Qiang Liu, Bo Yu, Shaoshan Liu","doi":"10.1109/FCCM.2019.00024","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00024","url":null,"abstract":"Bundle adjustment (BA) is a fundamental optimization technique used in many crucial applications, including 3D scene reconstruction, robotic localization, camera calibration, autonomous driving, space exploration, street view map generation etc. Essentially, BA is a joint non-linear optimization problem, and one which can consume a significant amount of time and power, especially for large optimization problems. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper we propose π-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC that exploits custom hardware for higher performance and power efficiency. Specifically, based on our key observation that not all points appear on all images in a BA problem, we designed and implemented a Co-Observation Optimization technique to accelerate BA operations with optimized usage of memory and computation resources. Experimental results confirm that π-BA outperforms the existing software implementations in terms of performance and power consumption.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114929312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs 当你可以一起工作时,为什么要竞争:持久rnn的FPGA-ASIC集成
E. Nurvitadhi, Dongup Kwon, A. Jafari, Andrew Boutros, Jaewoong Sim, Phil Tomson, H. Sumbul, Gregory K. Chen, Phil V. Knag, Raghavan Kumar, R. Krishnamurthy, Sergey Gribok, B. Pasca, M. Langhammer, Debbie Marr, A. Dasu
Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.
交互式智能服务(如智能web搜索)是重要的数据中心工作负载。它们依赖于具有严格延迟约束的数据密集型深度学习(DL)算法,因此需要平衡数据移动和计算能力。因此,将整个DL模型保持在片上的持久方法正在成为实时服务的新标准,以避免昂贵的片外内存访问。微软的Brainwave采用了这种方法,Nvidia的cuDNN库也提供了这种方法。本文对FPGA、GPU和FPGA+ASIC封装方案进行了比较研究。与之前的工作不同,我们提供了针对常见数值精度(FP32, INT8)和现代高端FPGA (Intel®Stratix®10),GPU (Nvidia Volta)和ASIC (10nm工艺)的公平和直接的比较,所有这些都使用持久方法。我们表明,在来自DeepBench的RNN、GRU和LSTM工作负载上,Stratix 10 fpga比Volta gpu的延迟低2.7倍(FP32)到8.6倍(INT8)。GPU只能利用其峰值TOPS的6%,而具有更平衡的片上内存和计算的FPGA可以实现更高的利用率(~57%)。我们还研究将ASIC芯片TensorRAM与FPGA集成为系统级封装,以增强片上存储器容量和带宽,并提供与所需带宽匹配的计算吞吐量。我们展示了一个小的32 mm2 TensorRAM 10nm芯片可以提供64 MB内存,32 TB/s片上带宽和64 TOPS (INT8)。一个带有TensorRAM (INT8)的小型Stratix 10 FPGA比GPU (FP32)提供15.9倍的延迟和34倍的能效。与大型FPGA或GPU相比,它具有2倍的片上内存容量。总的来说,我们的研究表明FPGA比GPU更适合持久DL,并且当与ASIC芯片集成时,它可以提供更引人注目的解决方案。
{"title":"Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs","authors":"E. Nurvitadhi, Dongup Kwon, A. Jafari, Andrew Boutros, Jaewoong Sim, Phil Tomson, H. Sumbul, Gregory K. Chen, Phil V. Knag, Raghavan Kumar, R. Krishnamurthy, Sergey Gribok, B. Pasca, M. Langhammer, Debbie Marr, A. Dasu","doi":"10.1109/FCCM.2019.00035","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00035","url":null,"abstract":"Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123755994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
期刊
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1