2021 International Conference on Field-Programmable Technology (ICFPT)最新文献

英文中文

Parallel-Pipeline Fast Walsh-Hadamard Transform Implementation Using HLS 利用HLS实现并行管道快速Walsh-Hadamard变换

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609874

A. M. García, C. O. Quero, J. Rangel-Magdaleno, J. Martínez-Carranza, D. D. Romero

Walsh Hadamard Transform (WHT) is an orthogonal, symmetric, involutional, and linear operation used in data encryption, data compression, and quantum computing. The WHT belongs to a generalized class of Fourier transforms, which allows that many algorithms developed for the fast Fourier transform (FFT) work for fast WHT implementations (FWHT). This paper employs this property and uses a parallel-pipeline FFT well-known strategy for VLSI implementation to build parallel-pipeline architectures for FWHT. We apply the FFT parallel-pipeline approach on a Fast WHT and use the High-Level Synthesis (HLS) tool from Xilinx Vitis to generate an FPGA solution. We also provide an open-source code with the basic blocks to build any model with any parallelization level. The parallel-pipeline proposed solutions achieve a latency reduction of up to 3.57% compared to a pipeline approach on a 256-long signal using 32 bit floating-point numbers.

Walsh Hadamard Transform (WHT)是一种用于数据加密、数据压缩和量子计算的正交、对称、对合和线性运算。WHT属于傅里叶变换的广义类，这使得许多为快速傅里叶变换(FFT)开发的算法适用于快速WHT实现(FWHT)。本文利用这一特性，采用VLSI实现中众所周知的并行流水线FFT策略来构建FWHT的并行流水线架构。我们将FFT并行管道方法应用于快速WHT，并使用Xilinx Vitis的高级合成(High-Level Synthesis, HLS)工具生成FPGA解决方案。我们还提供了一个带有基本块的开源代码，用于构建具有任何并行化级别的任何模型。与使用32位浮点数处理256长信号的管道方法相比，并行管道提出的解决方案可将延迟减少3.57%。

引用次数: 0

StateLink: FPGA System Debugging via Flexible Simulation/Hardware Integration StateLink: FPGA系统调试通过灵活的仿真/硬件集成

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609846

Sameh Attia, Vaughn Betz

Checkpoint-based debugging flows that allow moving the design state between an FPGA and a simulator have recently emerged. These flows combine the speed of hardware execution and the full observability and controllability of HDL simulation. However, they assume the entire system state can be moved to a simulator, limiting them to self-contained systems and precluding their use in network or CPU-attached FPGAs. In this paper, we present StateLink, a co-simulation framework that allows a design-under-test (DUT) running in a simulator to interact with other design elements that reside in hardware. StateLink creates links between DUT interfaces in the HDL simulation and their equivalents in hardware, thereby allowing the DUT to remain connected to and active in the overall hardware system after its state is moved to a simulator. This extends the functionality of checkpoint-based debugging frameworks to designs with external I/Os such as DRAM and Ethernet, and to designs that contain components with no simulation models. It also significantly decreases the simulation time of DUTs that are part of a large system. For example, it speeds up the HDL simulation of designs that interface with DRAM by up to 25 ×. Incorporating StateLink in a design typically adds no timing overhead and a modest hardware area overhead; for example, StateLink adds 916 LUTs to a 32-bit AXI memory-mapped and 1423 LUTs to a 32-bit AXI streaming interface.

最近出现了允许在FPGA和模拟器之间移动设计状态的基于检查点的调试流。这些流程结合了硬件执行速度和HDL仿真的完全可观察性和可控性。然而，它们假设整个系统状态可以移动到模拟器，将它们限制在自包含的系统中，并排除了它们在网络或cpu连接的fpga中的使用。在本文中，我们提出了StateLink，这是一个联合仿真框架，允许在模拟器中运行的测试设计(DUT)与驻留在硬件中的其他设计元素进行交互。StateLink在HDL仿真中的被测件接口与其硬件中的等效接口之间创建链接，从而允许被测件在其状态移动到模拟器后保持与整个硬件系统的连接和活动。这将基于检查点的调试框架的功能扩展到具有外部I/ o(如DRAM和以太网)的设计，以及包含没有仿真模型的组件的设计。它还显著减少了作为大型系统一部分的dut的仿真时间。例如，它将与DRAM接口的设计的HDL模拟速度提高了25倍。将StateLink设计通常不添加时间开销和适度的硬件面积开销;例如，StateLink向32位AXI内存映射添加916个lut，向32位AXI流接口添加1423个lut。

{"title":"StateLink: FPGA System Debugging via Flexible Simulation/Hardware Integration","authors":"Sameh Attia, Vaughn Betz","doi":"10.1109/ICFPT52863.2021.9609846","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609846","url":null,"abstract":"Checkpoint-based debugging flows that allow moving the design state between an FPGA and a simulator have recently emerged. These flows combine the speed of hardware execution and the full observability and controllability of HDL simulation. However, they assume the entire system state can be moved to a simulator, limiting them to self-contained systems and precluding their use in network or CPU-attached FPGAs. In this paper, we present StateLink, a co-simulation framework that allows a design-under-test (DUT) running in a simulator to interact with other design elements that reside in hardware. StateLink creates links between DUT interfaces in the HDL simulation and their equivalents in hardware, thereby allowing the DUT to remain connected to and active in the overall hardware system after its state is moved to a simulator. This extends the functionality of checkpoint-based debugging frameworks to designs with external I/Os such as DRAM and Ethernet, and to designs that contain components with no simulation models. It also significantly decreases the simulation time of DUTs that are part of a large system. For example, it speeds up the HDL simulation of designs that interface with DRAM by up to 25 ×. Incorporating StateLink in a design typically adds no timing overhead and a modest hardware area overhead; for example, StateLink adds 916 LUTs to a 32-bit AXI memory-mapped and 1423 LUTs to a 32-bit AXI streaming interface.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131400079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Characterization of IOBUF-based Ring Oscillators 基于iobuf的环形振荡器的特性

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609950

J. Burgiel, Daniel E. Holcomb, Ilias Giechaskiel, Shanquan Tian, Jakub Szefer

Ring Oscillators (ROs) are fundamental primitives that are used as building blocks in many other types of circuits. This paper presents an in-depth characterization of ring oscillators which leverage the IOBUF primitive found in modern Xilinx FPGAs. This work first analyzes the impact of the drive strength and slew rate attributes of the IOBUFs on the ROs, and also characterizes the impacts of external temperature, internal voltage, and external voltage fluctuations on the frequency of the proposed ROs. This work further demonstrates that IOBUF-based ROs can detect whether electrical connections to the IOBUF pins have changed, including whether the DRAM module has been physically removed. Finally, the proposed ROs can be realized on cloud FPGAs, bypassing the restrictions that some cloud providers impose on combinatorial loops, and thus presenting a new security threat to remote FPGAs.

环形振荡器(ROs)是在许多其他类型的电路中用作构建块的基本基元。本文介绍了利用现代赛灵思fpga中发现的IOBUF原语的环形振荡器的深入表征。这项工作首先分析了iobuf的驱动强度和转换率属性对ROs的影响，并表征了外部温度、内部电压和外部电压波动对所提出的ROs频率的影响。这项工作进一步证明，基于IOBUF的ROs可以检测到与IOBUF引脚的电连接是否发生了变化，包括DRAM模块是否已被物理移除。最后，所提出的ROs可以在云fpga上实现，绕过一些云提供商对组合环路的限制，从而对远程fpga提出新的安全威胁。

引用次数: 1

StreamZip: Compressed Sliding-Windows for Stream Aggregation 用于流聚合的压缩滑动窗口

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609952

Prajith Ramakrishnan Geethakumari, I. Sourdis

High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding-window before processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This paper addresses this problem by introducing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding-windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.

高性能流聚合对于许多分析大量数据的新兴应用程序至关重要。在处理之前，传入的数据需要存储在滑动窗口中，以防聚合函数不能增量计算。用新的传入值更新窗口并读取窗口以提供聚合函数是流聚合中的两个主要步骤。尽管使用多级队列可以有效地支持窗口更新，但频繁的窗口聚合仍然是性能瓶颈，因为它们给内存带宽和容量带来了巨大的压力。本文通过引入StreamZip来解决这个问题，StreamZip是一个能够压缩滑动窗口的数据流聚合引擎。StreamZip处理了许多数据和控制依赖的挑战，在流聚合管道中集成了一个压缩器，减轻了频繁聚合带来的内存压力。这样，StreamZip提供了更高的吞吐量以及更大的有效窗口容量来支持更大的问题。StreamZip支持多种压缩算法，为整数和浮点数提供无损和有损压缩。与没有压缩的设计相比，StreamZip无损和有损设计实现了高达7倍和22倍的高吞吐量，同时将有效内存容量分别提高了5倍和23倍。

{"title":"StreamZip: Compressed Sliding-Windows for Stream Aggregation","authors":"Prajith Ramakrishnan Geethakumari, I. Sourdis","doi":"10.1109/ICFPT52863.2021.9609952","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609952","url":null,"abstract":"High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding-window before processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This paper addresses this problem by introducing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding-windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"17 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113977384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A dataset generation for object recognition and a tool for generating ROS2 FPGA node 一种用于目标识别的数据集生成和生成ROS2 FPGA节点的工具

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609880

Hayato Amano, Hayato Mori, Akinobu Mizutani, Tomohiro Ono, Yuma Yoshimoto, Takeshi Ohkawa, H. Tamukoh

This paper introduces our autonomous driving system equipped with recognition processing units from a camera image for hazard object / human-doll detection and drive lane detection. In particular, this paper focuses on a dataset generation method for neural networks and a generation tool “FPGA Oriented Easy Synthesizer Tool (FOrEST)” for ROS2-FPGA nodes. The results show that mAP of a neural network trained by the generated dataset is 94%, and a overhead of ROS2-FPGA communication by the FOrEST is 2–3 ms.

本文介绍了我们的自动驾驶系统，该系统配备了来自相机图像的识别处理单元，用于危险物体/人偶检测和车道检测。本文重点研究了一种神经网络数据集生成方法，以及面向ROS2-FPGA节点的生成工具“FPGA Oriented Easy Synthesizer tool (FOrEST)”。结果表明，使用生成的数据集训练的神经网络的mAP值为94%，FOrEST对ROS2-FPGA的通信开销为2-3 ms。

引用次数: 4

A High-Precision Flexible Symmetry-Aware Architecture for Element-Wise Activation Functions 面向元素激活函数的高精度灵活对称感知体系结构

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609865

Xuan Feng, Yue Li, Yu Qian, Jingbo Gao, Wei Cao, Lingli Wang

Nonlinear activation functions (NAFs) play an essential role in deep neural networks (DNNs). Since versatile DNN accelerators need to support various DNNs which contain different NAFs, the flexible hardware design supporting those NAFs has become crucial. However, there are few high-precision flexible hardware architectures, and the symmetries of different NAFs have not been fully studied. This paper proposes a high-precision symmetry-aware architecture based on piecewise linear approximation. Through the reconfigurable data path, the architecture can support various typical NAFs. The efficient non-uniform segmentation scheme is proposed to achieve high precision for each NAF. Besides, the utilization of unified symmetry for NAFs can save half the memory. To reduce the computational cost, a 25×18 DSP is shared by two INT 7×9 multipliers with two independent inputs. The architecture is implemented on Xilinx ZC706 at a frequency of 410MHz. Compared with the state-of-the-art flexible nonlinear core, our flexible architecture costs fewer hardware resources with higher precision. Applying the design to BERT-BASE, MobileNetV3, and EfficientNet-B3 on the PyTorch platform, experimental results show that the accuracy loss is either 0 for BERT-BASE, or 0.002% for EfficientNet-B3. For MobileNetV3, the accuracy is even improved by 0.01%.

非线性激活函数(NAFs)在深度神经网络中起着至关重要的作用。由于通用DNN加速器需要支持包含不同NAFs的各种DNN，因此支持这些NAFs的灵活硬件设计变得至关重要。然而，目前高精度柔性硬件体系结构很少，不同柔性硬件的对称性也没有得到充分的研究。提出了一种基于分段线性逼近的高精度对称感知结构。通过可重构的数据路径，该体系结构可以支持各种典型的NAFs。提出了一种高效的非均匀分割方案，以实现对每个NAF的高精度分割。此外，使用统一对称的naf可以节省一半的内存。为了降低计算成本，一个25×18 DSP由两个具有两个独立输入的INT 7×9乘法器共享。该架构在Xilinx ZC706上实现，频率为410MHz。与目前最先进的柔性非线性核心相比，我们的柔性架构消耗的硬件资源更少，精度更高。将该设计应用于PyTorch平台上的BERT-BASE、MobileNetV3和effentnet - b3，实验结果表明BERT-BASE的精度损失为0,effentnet - b3的精度损失为0.002%。对于MobileNetV3，准确率甚至提高了0.01%。

{"title":"A High-Precision Flexible Symmetry-Aware Architecture for Element-Wise Activation Functions","authors":"Xuan Feng, Yue Li, Yu Qian, Jingbo Gao, Wei Cao, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609865","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609865","url":null,"abstract":"Nonlinear activation functions (NAFs) play an essential role in deep neural networks (DNNs). Since versatile DNN accelerators need to support various DNNs which contain different NAFs, the flexible hardware design supporting those NAFs has become crucial. However, there are few high-precision flexible hardware architectures, and the symmetries of different NAFs have not been fully studied. This paper proposes a high-precision symmetry-aware architecture based on piecewise linear approximation. Through the reconfigurable data path, the architecture can support various typical NAFs. The efficient non-uniform segmentation scheme is proposed to achieve high precision for each NAF. Besides, the utilization of unified symmetry for NAFs can save half the memory. To reduce the computational cost, a 25×18 DSP is shared by two INT 7×9 multipliers with two independent inputs. The architecture is implemented on Xilinx ZC706 at a frequency of 410MHz. Compared with the state-of-the-art flexible nonlinear core, our flexible architecture costs fewer hardware resources with higher precision. Applying the design to BERT-BASE, MobileNetV3, and EfficientNet-B3 on the PyTorch platform, experimental results show that the accuracy loss is either 0 for BERT-BASE, or 0.002% for EfficientNet-B3. For MobileNetV3, the accuracy is even improved by 0.01%.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123892828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An efficient RTL buffering scheme for an FPGA-accelerated simulation of diffuse radiative transfer 一种用于fpga加速模拟扩散辐射传输的有效RTL缓冲方案

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609944

Kazuki Furukawa, Ryohei Kobayashi, Tomoya Yokono, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, M. Umemura

This paper proposes the efficient buffering approach for implementing radiative transfer equations to bridge the performance gap between processing elements and HBM memory bandwidth. The radiation transfer equation originally focuses on the fundamental physics process in astrophysics. Besides, it has become the focus of a lot of attention in recent years because of the wealth of applications such as medical bioimaging. However, the acceleration requires a complicated memory access pattern with low latency, and the earlier studies unveil conventional memory access based on software control has no aptitude for this computation. Thus, this article introduced an HBM FPGA and proposed an application-specific buffering mechanism called PRISM (PRefetchable and Instantly accessible Scratchpad Memory) to efficiently bridge the computational unit and the HBM. The proposed approach was evaluated on a XILINX Alveo U280 FPGA, and the experimental results are also discussed.

本文提出了一种有效的实现辐射传递方程的缓冲方法，以弥合处理元件与HBM存储器带宽之间的性能差距。辐射传递方程最初关注的是天体物理学中的基本物理过程。此外，由于医学生物成像等应用的丰富，近年来已成为人们关注的焦点。然而，加速需要一个复杂的低延迟的存储器访问模式，而早期的研究揭示了基于软件控制的传统存储器访问不适合这种计算。因此，本文介绍了一种HBM FPGA，并提出了一种特定于应用程序的缓冲机制，称为PRISM (PRefetchable and instant accessible Scratchpad Memory)，以有效地桥接计算单元和HBM。在XILINX Alveo U280 FPGA上对该方法进行了测试，并对实验结果进行了讨论。

引用次数: 0

High-Performance Hardware Implementation of CRYSTALS-Dilithium 晶体-二锂的高性能硬件实现

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609917

Luke Beckwith, D. Nguyen, K. Gaj

Many currently deployed public-key cryptosystems are based on the difficulty of the discrete logarithm and integer factorization problems. However, given an adequately sized quantum computer, these problems can be solved in polynomial time as a function of the key size. Due to the future threat of quantum computing to current cryptographic standards, alternative algorithms that remain secure under quantum computing are being evaluated for future use. One such algorithm is CRYSTALS-Dilithium, a lattice-based digital signature scheme, which is a finalist in the NIST Post Quantum Cryptography (PQC) competition. As a part of this evaluation, high-performance implementations of these algorithms must be investigated. This work presents a high-performance implementation of CRYSTALS-Dilithium targeting FPGAs. In particular, we present a design that achieves the best latency for an FPGA implementation to date. We also compare our results with the most-relevant previous work on hardware implementations of NIST Round 3 post-quantum digital signature candidates.

目前部署的许多公钥密码系统都是基于离散对数和整数分解问题的难度。然而，给定一个足够大小的量子计算机，这些问题可以在多项式时间内作为密钥大小的函数来解决。由于量子计算对当前加密标准的未来威胁，正在评估在量子计算下保持安全的替代算法以供未来使用。其中一种算法是CRYSTALS-Dilithium，这是一种基于格子的数字签名方案，它是NIST后量子密码学(PQC)竞赛的决赛选手。作为评估的一部分，必须研究这些算法的高性能实现。这项工作提出了一个高性能的晶体-二锂靶向fpga的实现。特别是，我们提出了一种设计，可以实现迄今为止FPGA实现的最佳延迟。我们还将我们的结果与NIST第3轮后量子数字签名候选者的硬件实现的最相关的先前工作进行了比较。

引用次数: 25

StreamSVD: Low-rank Approximation and Streaming Accelerator Co-design 流svd:低秩近似和流加速器协同设计

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609813

Zhewen Yu, C. Bouganis

The post-training compression of a Convolutional Neural Network (CNN) aims to produce Pareto-optimal designs on the accuracy-performance frontier when the access to training data is not possible. Low-rank approximation is one of the methods that is often utilised in such cases. However, existing work considers the low-rank approximation of the network and the optimisation of the hardware accelerator separately, leading to systems with sub-optimal performance. This work focuses on the efficient mapping of a CNN into an FPGA device, and presents StreamSVD, a model-accelerator co-design framework1. The framework considers simultaneously the compression of a CNN model through a hardware-aware low-rank approximation scheme, and the optimisation of the hardware accelerator's architecture by taking into account the approximation scheme's compute structure. Our results show that the co-designed StreamSVD outperforms existing work that utilises similar low-rank approximation schemes by providing better accuracy-throughput trade-off. The proposed framework also achieves competitive performance compared with other post-training compression methods, even outperforming them under certain cases.

卷积神经网络(CNN)的训练后压缩旨在当无法访问训练数据时，在精度-性能边界上产生帕累托最优设计。低秩近似是在这种情况下经常使用的方法之一。然而，现有的工作分别考虑了网络的低秩近似和硬件加速器的优化，导致系统具有次优性能。这项工作的重点是将CNN有效地映射到FPGA器件，并提出了StreamSVD，一个模型加速器协同设计框架1。该框架同时考虑通过硬件感知的低秩近似方案对CNN模型进行压缩，并通过考虑近似方案的计算结构来优化硬件加速器的体系结构。我们的结果表明，共同设计的StreamSVD通过提供更好的精度-吞吐量权衡，优于利用类似低秩近似方案的现有工作。与其他训练后压缩方法相比，所提出的框架也具有竞争力，在某些情况下甚至优于其他方法。

{"title":"StreamSVD: Low-rank Approximation and Streaming Accelerator Co-design","authors":"Zhewen Yu, C. Bouganis","doi":"10.1109/ICFPT52863.2021.9609813","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609813","url":null,"abstract":"The post-training compression of a Convolutional Neural Network (CNN) aims to produce Pareto-optimal designs on the accuracy-performance frontier when the access to training data is not possible. Low-rank approximation is one of the methods that is often utilised in such cases. However, existing work considers the low-rank approximation of the network and the optimisation of the hardware accelerator separately, leading to systems with sub-optimal performance. This work focuses on the efficient mapping of a CNN into an FPGA device, and presents StreamSVD, a model-accelerator co-design framework1. The framework considers simultaneously the compression of a CNN model through a hardware-aware low-rank approximation scheme, and the optimisation of the hardware accelerator's architecture by taking into account the approximation scheme's compute structure. Our results show that the co-designed StreamSVD outperforms existing work that utilises similar low-rank approximation schemes by providing better accuracy-throughput trade-off. The proposed framework also achieves competitive performance compared with other post-training compression methods, even outperforming them under certain cases.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122001560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

General routing architecture modelling and exploration for modern FPGAs 现代fpga通用路由体系结构建模与探索

2021 International Conference on Field-Programmable Technology (ICFPT)

Pub Date : 2021-12-06 DOI: 10.1109/ICFPT52863.2021.9609935

Jiadong Qian, Yuhang Shen, Kaichuang Shi, Hao Zhou, Lingli Wang

Routing architecture has a significant impact on the area, critical path delay and power consumption of modern FPGAs. The most common routing architecture of island-style FPGAs in academia is the CB-SB model, which is not effective to model complex routing architectures in modern FPGAs. To improve the routability and performance of the existing routing model, we propose a new routing model called General Routing Block (GRB) to model complex commercial FPGAs. In the proposed model, all routing resources can be divided into three modules: general switch block (GSB), input connection block (ICB) and output connection block (OCB). The GSB and ICB are extended from the SB and CB with more flexible and richer connections. The OCB is a new module that provides novel connections for the LB output pins. We support bent wire architecture to reduce the delay, and two-level MUXes with output sharing to achieve a better trade-off between the area and flexibility. Moreover, to explore the trade-offs of different design spaces and find better architectures, an architecture exploration platform based on the simulated annealing algorithm is proposed to efficiently explore the enormous design space specified by a set of parameters. The results of global design space exploration show that the architecture with the proposed GRB model reduces the critical path delay by 15.5% and area-delay product by 14.8% compared to the length-4 CB-SB architecture based on the VTR benchmarks. After further local subspace explorations, the best architecture can achieve an 18.7% improvement on the critical path delay and a 23.8% improvement on the area-delay product, which represents a significant improvement over other routing architectures.

路由结构对现代fpga的面积、关键路径延迟和功耗有着重要的影响。目前学术界最常用的岛式fpga路由结构是CB-SB模型，该模型对现代fpga中复杂的路由结构建模效果较差。为了提高现有路由模型的可达性和性能，我们提出了一种新的路由模型，称为通用路由块(GRB)来建模复杂的商用fpga。在该模型中，所有路由资源可划分为三个模块:通用交换块(GSB)、输入连接块(ICB)和输出连接块(OCB)。GSB和ICB是在SB和CB基础上扩展而来的，连接更加灵活和丰富。OCB是一种新型模块，可为LB输出引脚提供新颖的连接。我们支持弯线架构，以减少延迟，并支持输出共享的双电平mux，以实现面积和灵活性之间的更好权衡。此外，为了探索不同设计空间的权衡，找到更好的架构，提出了一种基于模拟退火算法的架构探索平台，以有效地探索由一组参数指定的巨大设计空间。全局设计空间探索结果表明，与基于VTR基准的长度-4 CB-SB架构相比，采用该模型的架构可将关键路径延迟降低15.5%，区域延迟乘积降低14.8%。经过进一步的局部子空间探索，最佳架构在关键路径延迟上提高了18.7%，在区域延迟积上提高了23.8%，与其他路由架构相比，这是一个显著的改进。

{"title":"General routing architecture modelling and exploration for modern FPGAs","authors":"Jiadong Qian, Yuhang Shen, Kaichuang Shi, Hao Zhou, Lingli Wang","doi":"10.1109/ICFPT52863.2021.9609935","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609935","url":null,"abstract":"Routing architecture has a significant impact on the area, critical path delay and power consumption of modern FPGAs. The most common routing architecture of island-style FPGAs in academia is the CB-SB model, which is not effective to model complex routing architectures in modern FPGAs. To improve the routability and performance of the existing routing model, we propose a new routing model called General Routing Block (GRB) to model complex commercial FPGAs. In the proposed model, all routing resources can be divided into three modules: general switch block (GSB), input connection block (ICB) and output connection block (OCB). The GSB and ICB are extended from the SB and CB with more flexible and richer connections. The OCB is a new module that provides novel connections for the LB output pins. We support bent wire architecture to reduce the delay, and two-level MUXes with output sharing to achieve a better trade-off between the area and flexibility. Moreover, to explore the trade-offs of different design spaces and find better architectures, an architecture exploration platform based on the simulated annealing algorithm is proposed to efficiently explore the enormous design space specified by a set of parameters. The results of global design space exploration show that the architecture with the proposed GRB model reduces the critical path delay by 15.5% and area-delay product by 14.8% compared to the length-4 CB-SB architecture based on the VTR benchmarks. After further local subspace explorations, the best architecture can achieve an 18.7% improvement on the critical path delay and a 23.8% improvement on the area-delay product, which represents a significant improvement over other routing architectures.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"306 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116606395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 International Conference on Field-Programmable Technology (ICFPT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀