ACM Transactions on Reconfigurable Technology and Systems最新文献_第6页

An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning 基于fpga的深度可分离卷积神经网络硬件剪枝加速器

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-13 DOI: 10.1145/3615661

Zhengyan Liu, Qiang Liu, Shun Yan, Ray C.C. Cheung

Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. This paper proposes an efficient inference accelerator on FPGA for CNNs with depthwise separable convolutions (DSCs). To improve the accelerator efficiency, we make four contributions: (1) an efficient convolution engine with multiple strategies for exploiting parallelism and a configurable adder tree are designed to support three types of convolution operations; (2) a dedicated architecture combined with input buffers is designed for the bottleneck network structure to reduce data transmission time; (3) a hardware padding scheme to eliminate invalid padding operations is proposed; (4) a hardware-assisted pruning method is developed to support online trade-off between model accuracy and power consumption. Experimental results show that for MobileNetV2 the accelerator achieves 10x and 6x energy efficiency improvement over the CPU and GPU implementation, and 302.3 FPS and 181.8 GOPS performance which is the best among several existing single-engine accelerators on FPGAs. The proposed hardware-assisted pruning method can effectively reduce 59.7% power consumption at the accuracy loss within 5%.

卷积神经网络(cnn)在计算机视觉任务中得到了广泛的应用。然而，CNN的计算量和资源密集的特点给其在嵌入式系统上的应用带来了障碍。提出了一种基于FPGA的深度可分离卷积cnn的高效推理加速器。为了提高加速器的效率，我们做了以下四点贡献:(1)设计了一个高效的卷积引擎和一个可配置的加法器树来支持三种类型的卷积操作;(2)针对瓶颈网络结构，设计了结合输入缓冲区的专用架构，减少数据传输时间;(3)提出了一种消除无效填充操作的硬件填充方案;(4)开发了一种硬件辅助剪枝方法，以支持模型精度和功耗之间的在线权衡。实验结果表明，对于MobileNetV2，该加速器的能效比CPU和GPU实现分别提高了10倍和6倍，达到了302.3 FPS和181.8 GOPS的性能，在现有的fpga单引擎加速器中是最好的。所提出的硬件辅助剪枝方法可以有效降低59.7%的功耗，精度损失在5%以内。

{"title":"An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning","authors":"Zhengyan Liu, Qiang Liu, Shun Yan, Ray C.C. Cheung","doi":"10.1145/3615661","DOIUrl":"https://doi.org/10.1145/3615661","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. This paper proposes an efficient inference accelerator on FPGA for CNNs with depthwise separable convolutions (DSCs). To improve the accelerator efficiency, we make four contributions: (1) an efficient convolution engine with multiple strategies for exploiting parallelism and a configurable adder tree are designed to support three types of convolution operations; (2) a dedicated architecture combined with input buffers is designed for the bottleneck network structure to reduce data transmission time; (3) a hardware padding scheme to eliminate invalid padding operations is proposed; (4) a hardware-assisted pruning method is developed to support online trade-off between model accuracy and power consumption. Experimental results show that for MobileNetV2 the accelerator achieves 10x and 6x energy efficiency improvement over the CPU and GPU implementation, and 302.3 FPS and 181.8 GOPS performance which is the best among several existing single-engine accelerators on FPGAs. The proposed hardware-assisted pruning method can effectively reduce 59.7% power consumption at the accuracy loss within 5%.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135736422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

XVDPU: A High Performance CNN Accelerator on Versal Platform Powered by AI Engine xvppu:基于AI引擎的通用平台上的高性能CNN加速器

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-13 DOI: 10.1145/3617836

Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, Dongdong Xu, Zhuohuan Liu, Mengke Liu, Xiaoyang Yan, Hong Wang, Rongzhang Zheng, Li Wang, Dong Li, Satyaprakash Pareek, Jian Weng, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan

Nowadays, convolution neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this paper, we propose XVDPU: the AI-Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve IO bottleneck, we adopt several techniques to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) is further proposed which can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8 × faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4050 FPS. We propose a tilling strategy to achieve feature-map-stationary (FMS) for high-definition CNN (HD-CNN) with the accelerator, achieving 3.8 × FPS improvement on Residual Channel Attention Network (RCAN) and 3.1 × on Super-Efficient Super-Resolution (SESR). This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end (E2E) performance of 10.1FPS with all the optimizations.

目前，卷积神经网络(cnn)在计算机视觉领域得到了广泛的应用。然而，更高的精度和更高的分辨率的趋势产生更大的网络。计算或I/O需求是关键的瓶颈。本文提出了基于AI-Engine (AIE)的通用芯片CNN加速器xvppu，以满足繁重的计算需求。为了解决IO瓶颈，我们采用了几种技术来改善数据重用和减少I/O需求。进一步提出了一种算术逻辑单元(ALU)，可以更好地平衡资源利用率、新特征支持和整个系统的效率。我们已经用我们的加速器成功部署了100多个CNN模型。实验结果表明，96核实现在VCK190上的ResNet50可以达到1653帧每秒(FPS)，比在ZCU102上运行的168.5帧每秒快9.8倍。256- ae核实现可以进一步达到4050 FPS。我们提出了一种利用加速器实现高清CNN (HD-CNN)特征映射静止(FMS)的耕作策略，在残差通道注意网络(RCAN)上实现3.8倍的FPS提升，在超高效超分辨率(SESR)上实现3.1倍的FPS提升。该加速器还可以解决视差估计中的3D卷积任务，通过所有优化，实现10.1FPS的端到端(E2E)性能。

{"title":"XVDPU: A High Performance CNN Accelerator on Versal Platform Powered by AI Engine","authors":"Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, Dongdong Xu, Zhuohuan Liu, Mengke Liu, Xiaoyang Yan, Hong Wang, Rongzhang Zheng, Li Wang, Dong Li, Satyaprakash Pareek, Jian Weng, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan","doi":"10.1145/3617836","DOIUrl":"https://doi.org/10.1145/3617836","url":null,"abstract":"Nowadays, convolution neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this paper, we propose XVDPU: the AI-Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve IO bottleneck, we adopt several techniques to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) is further proposed which can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8 × faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4050 FPS. We propose a tilling strategy to achieve feature-map-stationary (FMS) for high-definition CNN (HD-CNN) with the accelerator, achieving 3.8 × FPS improvement on Residual Channel Attention Network (RCAN) and 3.1 × on Super-Efficient Super-Resolution (SESR). This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end (E2E) performance of 10.1FPS with all the optimizations.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135736217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CHIP-KNNv2: A C onfigurable and Hi gh- P erformance K - N earest N eighbors Accelerator on HBM-based FPGAs CHIP-KNNv2:基于hbm fpga的C可配置高性能K - N近邻加速器

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-13 DOI: 10.1145/3616873

Kenneth Liu, Alec Lu, Kartik Samtani, Zhenman Fang, Licheng Guo

The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets. In this paper, we design and implement CHIP-KNN: an HLS-based, configurable, and high-performance KNN accelerator. CHIP-KNN optimizes the off-chip memory access on modern HBM-based FPGAs such as the AMD/Xilinx Alveo U280 FPGA board. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension and data type representation of each data point, the distance metric, and the number of nearest neighbors - K. In terms of design architecture, we explore and discuss the trade-offs between two design versions: CHIP-KNNv1 (Ping-Pong buffer based) and CHIP-KNNv2 (streaming-based). Moreover, we investigate the routing congestion issue in our accelerator design, implement hierarchical structures to shorten critical paths, and integrate an open-source floorplanning optimization tool called TAPA/AutoBridge to eliminate the place-and-route issues. To explore the design space and balance the computation and memory access performance, we also build an analytical performance model. Given a user configuration of the KNN parameters, our tool can automatically generate TAPA HLS C code for the optimal accelerator design and the corresponding host code, on the HBM-based FPGA platform. Our experimental results on the Alveo U280 show that, compared to a 48-thread CPU implementation, CHIP-KNNv2 achieves a geomean performance speedup of 15x, with a maximum speedup of 45x. Additionally, we show that CHIP-KNNv2 achieves up to 2.1x performance speedup over CHIP-KNNv1 while increasing configurability. Compared with the state-of-the-art Facebook AI Similarity Search (FAISS) [23] GPU implementation running on a Nvidia Tesla V100 GPU, CHIP-KNNv2 achieves an average latency reduction of 30.6x while requiring 34.3% of GPU power consumption.

KNN (k-nearest neighbors)算法是相似度搜索、图像分类、数据库查询等应用中必不可少的一种算法。随着数据集大小和每个数据点的特征维度的快速增长，处理KNN变得更加需要计算和内存。以往的研究大多集中在利用fpga丰富的并行资源来加速KNN的计算。然而，他们经常忽略FPGA平台上的内存访问优化，对于大型数据集，在多线程CPU实现上只实现了边际加速。在本文中，我们设计并实现CHIP-KNN:一个基于hls的、可配置的、高性能的KNN加速器。CHIP-KNN优化了现代基于hbm的FPGA(如AMD/Xilinx Alveo U280 FPGA板)上的片外存储器访问。CHIP-KNN对于算法中使用的所有基本参数都是可配置的，包括搜索数据集的大小、每个数据点的特征维度和数据类型表示、距离度量和最近邻居的数量k。在设计架构方面，我们探索和讨论了两个设计版本之间的权衡:CHIP-KNNv1(基于乒乓缓冲区)和CHIP-KNNv2(基于流)。此外，我们还研究了加速器设计中的路由拥塞问题，实现了分层结构以缩短关键路径，并集成了一个名为TAPA/AutoBridge的开源地板规划优化工具，以消除位置和路由问题。为了探索设计空间，平衡计算和内存访问性能，我们还建立了一个分析性能模型。给定KNN参数的用户配置，我们的工具可以在基于hbm的FPGA平台上自动生成用于最佳加速器设计的TAPA HLS C代码和相应的主机代码。我们在Alveo U280上的实验结果表明，与48线程CPU实现相比，CHIP-KNNv2实现了15倍的几何性能加速，最大加速为45倍。此外，我们表明CHIP-KNNv2在提高可配置性的同时，比CHIP-KNNv1实现了高达2.1倍的性能加速。与运行在Nvidia Tesla V100 GPU上的最先进的Facebook AI相似度搜索(FAISS) [23] GPU实现相比，CHIP-KNNv2实现了平均延迟降低30.6倍，同时需要34.3%的GPU功耗。

{"title":"CHIP-KNNv2: A C onfigurable and Hi gh- P erformance K - N earest N eighbors Accelerator on HBM-based FPGAs","authors":"Kenneth Liu, Alec Lu, Kartik Samtani, Zhenman Fang, Licheng Guo","doi":"10.1145/3616873","DOIUrl":"https://doi.org/10.1145/3616873","url":null,"abstract":"The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets. In this paper, we design and implement CHIP-KNN: an HLS-based, configurable, and high-performance KNN accelerator. CHIP-KNN optimizes the off-chip memory access on modern HBM-based FPGAs such as the AMD/Xilinx Alveo U280 FPGA board. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension and data type representation of each data point, the distance metric, and the number of nearest neighbors - K. In terms of design architecture, we explore and discuss the trade-offs between two design versions: CHIP-KNNv1 (Ping-Pong buffer based) and CHIP-KNNv2 (streaming-based). Moreover, we investigate the routing congestion issue in our accelerator design, implement hierarchical structures to shorten critical paths, and integrate an open-source floorplanning optimization tool called TAPA/AutoBridge to eliminate the place-and-route issues. To explore the design space and balance the computation and memory access performance, we also build an analytical performance model. Given a user configuration of the KNN parameters, our tool can automatically generate TAPA HLS C code for the optimal accelerator design and the corresponding host code, on the HBM-based FPGA platform. Our experimental results on the Alveo U280 show that, compared to a 48-thread CPU implementation, CHIP-KNNv2 achieves a geomean performance speedup of 15x, with a maximum speedup of 45x. Additionally, we show that CHIP-KNNv2 achieves up to 2.1x performance speedup over CHIP-KNNv1 while increasing configurability. Compared with the state-of-the-art Facebook AI Similarity Search (FAISS) [23] GPU implementation running on a Nvidia Tesla V100 GPU, CHIP-KNNv2 achieves an average latency reduction of 30.6x while requiring 34.3% of GPU power consumption.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135735617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Hardware Accelerator for the Semi-Global Matching Stereo Algorithm 半全局匹配立体算法的硬件加速器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-09 DOI: 10.1145/3615869

J. Kalomiros, J. Vourvoulakis, S. Vologiannidis

The semi-global matching stereo algorithm is a top performing algorithm in stereo vision. The recursive nature of the computations involved in this algorithm introduces an inherent data dependency problem, hindering the progressive computations of disparities at pixel clock. In this work, a novel hardware implementation of the semi-global matching algorithm is presented. A hardware structure of parallel comparators is proposed for the fast computation of the minima among large cost arrays in one clock cycle. Also, a hardware-friendly algorithm is proposed for the computation of the minima among far-indexed disparities, shortening the length of computations in the datapath. As a result, the recursive path cost computation is accelerated considerably. The system is implemented in a Stratix V device and in a Zynq UltraScale+ device. A throughput of 55,1 million disparities per second is achieved with maximum disparity 128 pixels and frame resolution 1280 × 720. The proposed architecture is less elaborate and more resource efficient than other systems in the literature and its performance compares favorably to them. An implementation on an actual FPGA board is also presented and serves as a real-world verification of the proposed system.

半全局匹配立体算法是立体视觉中性能最好的一种算法。该算法所涉及的递归计算引入了一个固有的数据依赖问题，阻碍了在像素时钟上对差异的渐进计算。本文提出了一种半全局匹配算法的硬件实现方法。为了在一个时钟周期内快速计算大代价阵列之间的最小值，提出了一种并行比较器的硬件结构。此外，提出了一种硬件友好的算法来计算远索引差的最小值，从而缩短了数据路径上的计算长度。从而大大加快了递归路径代价的计算速度。该系统在Stratix V设备和Zynq UltraScale+设备中实现。在最大视差128像素和帧分辨率1280 × 720的情况下，实现了每秒5510万视差的吞吐量。与文献中的其他系统相比，所提出的体系结构不那么复杂，资源效率更高，其性能也优于其他系统。本文还介绍了在实际的FPGA板上的实现，并对所提出的系统进行了实际验证。

{"title":"A Hardware Accelerator for the Semi-Global Matching Stereo Algorithm","authors":"J. Kalomiros, J. Vourvoulakis, S. Vologiannidis","doi":"10.1145/3615869","DOIUrl":"https://doi.org/10.1145/3615869","url":null,"abstract":"The semi-global matching stereo algorithm is a top performing algorithm in stereo vision. The recursive nature of the computations involved in this algorithm introduces an inherent data dependency problem, hindering the progressive computations of disparities at pixel clock. In this work, a novel hardware implementation of the semi-global matching algorithm is presented. A hardware structure of parallel comparators is proposed for the fast computation of the minima among large cost arrays in one clock cycle. Also, a hardware-friendly algorithm is proposed for the computation of the minima among far-indexed disparities, shortening the length of computations in the datapath. As a result, the recursive path cost computation is accelerated considerably. The system is implemented in a Stratix V device and in a Zynq UltraScale+ device. A throughput of 55,1 million disparities per second is achieved with maximum disparity 128 pixels and frame resolution 1280 × 720. The proposed architecture is less elaborate and more resource efficient than other systems in the literature and its performance compares favorably to them. An implementation on an actual FPGA board is also presented and serves as a real-world verification of the proposed system.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48268611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? 基于FPGA的深度学习推理加速器：我们站在哪里？

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-04 DOI: 10.1145/3613963

Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, R. Buchty, Mladen Berekovic

Recently, artificial intelligence applications have become part of almost all emerging technologies around us. Neural networks, in particular, have shown significant advantages and have been widely adopted over other approaches in machine learning. In this context, high processing power is deemed a fundamental challenge and a persistent requirement. Recent solutions facing such a challenge deploy hardware platforms to provide high computing performance for neural networks and deep learning algorithms. This direction is also rapidly taking over the market. Here, FPGAs occupy the middle ground regarding flexibility, reconfigurability, and efficiency compared to general-purpose CPUs, GPUs, on one side, and manufactured ASICs on the other. FPGA-based accelerators exploit the features of FPGAs to increase the computing performance for specific algorithms and algorithm features. Filling a gap, we provide holistic benchmarking criteria and optimization techniques that work across several classes of deep learning implementations. This paper summarizes the current state of deep learning hardware acceleration: More than 120 FPGA-based neural network accelerator designs are presented and evaluated based on a matrix of performance and acceleration criteria, and corresponding optimization techniques are presented and discussed. In addition, the evaluation criteria and optimization techniques are demonstrated by benchmarking ResNet-2 and LSTM-based accelerators.

最近，人工智能应用已经成为我们身边几乎所有新兴技术的一部分。特别是神经网络，已经显示出显著的优势，并且在机器学习中被广泛采用。在这种情况下，高处理能力被认为是一个基本的挑战和持久的需求。面对这样的挑战，最近的解决方案部署硬件平台，为神经网络和深度学习算法提供高计算性能。这个方向也正在迅速占领市场。在这里，与通用cpu、gpu和制造的asic相比，fpga在灵活性、可重构性和效率方面处于中间位置。基于fpga的加速器利用fpga的特性来提高特定算法和算法特征的计算性能。为了填补这一空白，我们提供了跨几类深度学习实现的整体基准测试标准和优化技术。本文总结了深度学习硬件加速的现状:提出了120多个基于fpga的神经网络加速器设计，并基于性能和加速标准矩阵进行了评估，并提出和讨论了相应的优化技术。此外，通过对基于ResNet-2和lstm的加速器进行基准测试，验证了评估标准和优化技术。

{"title":"FPGA-based Deep Learning Inference Accelerators: Where Are We Standing?","authors":"Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, R. Buchty, Mladen Berekovic","doi":"10.1145/3613963","DOIUrl":"https://doi.org/10.1145/3613963","url":null,"abstract":"Recently, artificial intelligence applications have become part of almost all emerging technologies around us. Neural networks, in particular, have shown significant advantages and have been widely adopted over other approaches in machine learning. In this context, high processing power is deemed a fundamental challenge and a persistent requirement. Recent solutions facing such a challenge deploy hardware platforms to provide high computing performance for neural networks and deep learning algorithms. This direction is also rapidly taking over the market. Here, FPGAs occupy the middle ground regarding flexibility, reconfigurability, and efficiency compared to general-purpose CPUs, GPUs, on one side, and manufactured ASICs on the other. FPGA-based accelerators exploit the features of FPGAs to increase the computing performance for specific algorithms and algorithm features. Filling a gap, we provide holistic benchmarking criteria and optimization techniques that work across several classes of deep learning implementations. This paper summarizes the current state of deep learning hardware acceleration: More than 120 FPGA-based neural network accelerator designs are presented and evaluated based on a matrix of performance and acceleration criteria, and corresponding optimization techniques are presented and discussed. In addition, the evaluation criteria and optimization techniques are demonstrated by benchmarking ResNet-2 and LSTM-based accelerators.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44121298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BLOOP: Boolean Satisfiability-based Optimized Loop Pipelining BLOOP：基于布尔可满足性的优化循环流水线

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-07-27 DOI: 10.1145/3599972

Nicolai Fiege, P. Zipf

Modulo scheduling is the premier technique for throughput maximization of loops in high-level synthesis by interleaving consecutive loop iterations. The number of clock cycles between data insertions is called the initiation interval (II). For throughput maximization, this value should be as low as possible; therefore, its minimization is the main optimization goal. Despite its long historical existence, modulo scheduling always remained a relevant research topic over the years with many exact and heuristic algorithms available in the literature. Nevertheless, we are able to leverage the scalability of modern Boolean Satisfiability (SAT) solvers to outperform state-of-the-art ILP-based algorithms for latency-optimal modulo scheduling for both integer and rational IIs. Our algorithm is able to compute valid modulo schedules for the whole CHStone and MachSuite benchmark suites, with 99% of the solutions being proven to be throughput optimal for a timeout of only 10 minutes per candidate II. For various time limits, not a single tested scheduler from the state of the art is able to compute more verified optimal solutions or even a single schedule with a higher throughput than our proposed approach. Using an HLS toolflow, we show that our algorithm can be effectively used to generate Pareto-optimal FPGA implementations regarding throughput and resource usage.

模调度是高阶合成中通过交错连续环路迭代实现环路吞吐量最大化的主要技术。数据插入之间的时钟周期数称为初始间隔(II)。为了实现吞吐量最大化，该值应尽可能低;因此，其最小化是主要的优化目标。尽管模调度有着悠久的历史，但多年来一直是一个相关的研究课题，文献中有许多精确的启发式算法。然而，我们能够利用现代布尔可满足性(SAT)求解器的可扩展性，在整数和有理i的延迟最优模调度方面优于最先进的基于ilp的算法。我们的算法能够为整个CHStone和MachSuite基准套件计算有效的模调度，99%的解决方案被证明在每个候选II的超时时间只有10分钟的情况下是吞吐量最佳的。对于各种时间限制，目前没有一个经过测试的调度程序能够计算出经过验证的最优解决方案，甚至没有一个调度程序具有比我们建议的方法更高的吞吐量。使用HLS工具流，我们证明了我们的算法可以有效地用于生成关于吞吐量和资源使用的帕累托最优FPGA实现。

{"title":"BLOOP: Boolean Satisfiability-based Optimized Loop Pipelining","authors":"Nicolai Fiege, P. Zipf","doi":"10.1145/3599972","DOIUrl":"https://doi.org/10.1145/3599972","url":null,"abstract":"Modulo scheduling is the premier technique for throughput maximization of loops in high-level synthesis by interleaving consecutive loop iterations. The number of clock cycles between data insertions is called the initiation interval (II). For throughput maximization, this value should be as low as possible; therefore, its minimization is the main optimization goal. Despite its long historical existence, modulo scheduling always remained a relevant research topic over the years with many exact and heuristic algorithms available in the literature. Nevertheless, we are able to leverage the scalability of modern Boolean Satisfiability (SAT) solvers to outperform state-of-the-art ILP-based algorithms for latency-optimal modulo scheduling for both integer and rational IIs. Our algorithm is able to compute valid modulo schedules for the whole CHStone and MachSuite benchmark suites, with 99% of the solutions being proven to be throughput optimal for a timeout of only 10 minutes per candidate II. For various time limits, not a single tested scheduler from the state of the art is able to compute more verified optimal solutions or even a single schedule with a higher throughput than our proposed approach. Using an HLS toolflow, we show that our algorithm can be effectively used to generate Pareto-optimal FPGA implementations regarding throughput and resource usage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2023-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43029030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Topgun: An ECC Accelerator for Private Set Intersection Topgun:一种私有集交叉口的ECC加速器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-07-13 DOI: https://dl.acm.org/doi/10.1145/3603114

Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang

Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform.

Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.

椭圆曲线加密(ECC)是目前应用最广泛的非对称加密算法之一，已被广泛应用于传输层安全(TLS)协议、区块链、安全多方计算等领域。作为最安全的ECC曲线之一，Curve25519被一些安全协议所采用，如TLS 1.3和DH-PSI (Diffie-Hellman Private Set Intersection)协议。对ECC的高性能实现提出了更高的要求，特别是在隐私保护平台中使用的DH-PSI协议。点乘法是ECC中主要的密码原语，计算成本很高。为了提高DH-PSI协议的性能，我们提出了一种新的高性能硬件架构Topgun，用于在Curve25519上进行点乘法运算。该体系结构具有流水线式有限域算术单元和简单高效的指令集体系结构。与Xilinx Zynq 7000系列FPGA上现有的最佳工作相比，我们的实现使用一个处理元件可以在同一设备上实现3.14倍的加速。据我们所知，我们的实施似乎是最先进的作品中最快的。我们还在Intel Agilex AGF027 FPGA上实现了由4个计算组组成的架构，每个计算组有16个pe。测量到的4.48 Mops/s的性能是以86瓦的功耗为代价实现的，这是在fpga上通过Curve25519进行点乘法的创纪录性能。

{"title":"Topgun: An ECC Accelerator for Private Set Intersection","authors":"Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang","doi":"https://dl.acm.org/doi/10.1145/3603114","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603114","url":null,"abstract":"Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"80 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Topgun: An ECC Accelerator for Private Set Intersection Topgun:一种用于私用集交叉口的ECC加速器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-07-13 DOI: 10.1145/3603114

Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang

Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.

椭圆曲线密码算法(ECC)是应用最广泛的非对称密码算法之一，已被应用于传输层安全(TLS)协议、区块链、安全多方计算等领域。作为最安全的ECC曲线之一，Curve25519被一些安全协议所采用，如TLS 1.3和DH-PSI (Diffie-Hellman Private Set Intersection)协议。对ECC的高性能实现提出了更高的要求，特别是在隐私保护平台中使用的DH-PSI协议。点乘法是ECC中主要的密码原语，计算成本很高。为了提高DH-PSI协议的性能，我们提出了一种新的高性能硬件架构Topgun，用于在Curve25519上进行点乘法运算。该体系结构具有流水线式有限域算术单元和简单高效的指令集体系结构。与Xilinx Zynq 7000系列FPGA上现有的最佳工作相比，我们的实现使用一个处理元件可以在同一设备上实现3.14倍的加速。据我们所知，我们的实施似乎是最先进的作品中最快的。我们还在Intel Agilex AGF027 FPGA上实现了由4个计算组组成的架构，每个计算组有16个pe。测量到的4.48 Mops/s的性能是以86瓦的功耗为代价实现的，这是在fpga上通过Curve25519进行点乘法的创纪录性能。

{"title":"Topgun: An ECC Accelerator for Private Set Intersection","authors":"Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang","doi":"10.1145/3603114","DOIUrl":"https://doi.org/10.1145/3603114","url":null,"abstract":"Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44420898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

NAPOLY: A Non-deterministic Automata Processor OverLaY 非确定性自动机处理器叠加

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-22 DOI: https://dl.acm.org/doi/10.1145/3593586

Rasha Karakchi, Jason D. Bakos

Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.

确定性和非确定性有限自动机(DFA和NFA)构成了许多大数据应用的核心。最近为DFA/NFA开发特定领域架构(Domain-Specific Architectures, dsa)的工作采用了不同的方法，但是为任意大的模式集、状态激活率和模式匹配率实现一致的吞吐量仍然是一个挑战。在这篇文章中，我们提出了NAPOLY(非确定性自动机处理器覆盖层)，一个FPGA覆盖层和相关的编译器。先前努力的一个常见限制是对NFA大小的限制，以实现所发布的吞吐量。NAPOLY针对快速重新编程进行了优化，以允许硬件的实际时分多路复用，并允许无限大小、无限状态激活率和高模式报告率的nfa的高渐近吞吐量。NAPOLY还允许离线生成具有状态容量和转换容量之间权衡的配置。在本文中，我们(1)使用封装在ANMLZoo基准测试套件中的基准测试来评估NAPOLY，(2)评估使用SAT求解器来分配物理资源，以及(3)将NAPOLY的性能与现有解决方案进行比较。NAPOLY在较大的基准测试、具有较高状态激活频率的基准测试和具有较高报告频率的基准测试中表现最佳。NAPOLY在12个基准测试中的10个中超过了CPU和GPU实现的最快速度。

{"title":"NAPOLY: A Non-deterministic Automata Processor OverLaY","authors":"Rasha Karakchi, Jason D. Bakos","doi":"https://dl.acm.org/doi/10.1145/3593586","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3593586","url":null,"abstract":"Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"86 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fixed-point FPGA Implementation of the FFT Accumulation Method for Real-time Cyclostationary Analysis 定点FPGA实现FFT累加法实时循环平稳分析

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-22 DOI: https://dl.acm.org/doi/10.1145/3567429

Carol Jingyi Li, Xiangwei Li, Binglei Lou, Craig T. Jin, David Boland, Philip H. W. Leong

The spectral correlation density (SCD) is an important tool in cyclostationary signal detection and classification. Even using efficient techniques based on the fast Fourier transform (FFT), real-time implementations are challenging because of the high computational complexity. A key dimension for computational optimization lies in minimizing the wordlength employed. In this article, we analyze the relationship between wordlength and signal-to-quantization noise in fixed-point implementations of the SCD function. A canonical SCD estimation algorithm, the FFT accumulation method (FAM) using fixed-point arithmetic, is studied. We derive closed-form expressions for SQNR and compare them at wordlengths ranging from 14 to 26 bits. The differences between the calculated SQNR and bit-exact simulations are less than 1 dB. Furthermore, an HLS-based FPGA design is implemented on a Xilinx Zynq UltraScale+ XCZU28DR-2FFVG1517E RFSoC. Using less than 25% of the logic fabric on the device, it consumes 7.7 W total on-chip power and has a power efficiency of 12.4 GOPS/W, which is an order of magnitude improvement over an Nvidia Tesla K40 graphics processing unit (GPU) implementation. In terms of throughput, it achieves 50 MS/sec, which is a speedup of 1.6 over a recent optimized FPGA implementation.

谱相关密度(SCD)是周期平稳信号检测和分类的重要工具。即使使用基于快速傅里叶变换(FFT)的高效技术，由于高计算复杂度，实时实现也是具有挑战性的。计算优化的一个关键维度在于最小化所使用的字长。在本文中，我们分析了在SCD函数的定点实现中字长与信量化噪声之间的关系。研究了一种典型的SCD估计算法——基于不动点算法的FFT积累法(FAM)。我们推导了SQNR的封闭表达式，并在14到26位的字长范围内对它们进行了比较。计算得到的SQNR与位精确模拟结果的差异小于1 dB。此外，基于hls的FPGA设计在Xilinx Zynq UltraScale+ XCZU28DR-2FFVG1517E RFSoC上实现。在器件上使用不到25%的逻辑结构，它的片上总功耗为7.7 W，功率效率为12.4 GOPS/W，比Nvidia Tesla K40图形处理单元(GPU)实现提高了一个数量级。在吞吐量方面，它达到了50 MS/sec，比最近优化的FPGA实现的速度提高了1.6。

{"title":"Fixed-point FPGA Implementation of the FFT Accumulation Method for Real-time Cyclostationary Analysis","authors":"Carol Jingyi Li, Xiangwei Li, Binglei Lou, Craig T. Jin, David Boland, Philip H. W. Leong","doi":"https://dl.acm.org/doi/10.1145/3567429","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3567429","url":null,"abstract":"The spectral correlation density (SCD) is an important tool in cyclostationary signal detection and classification. Even using efficient techniques based on the fast Fourier transform (FFT), real-time implementations are challenging because of the high computational complexity. A key dimension for computational optimization lies in minimizing the wordlength employed. In this article, we analyze the relationship between wordlength and signal-to-quantization noise in fixed-point implementations of the SCD function. A canonical SCD estimation algorithm, the FFT accumulation method (FAM) using fixed-point arithmetic, is studied. We derive closed-form expressions for SQNR and compare them at wordlengths ranging from 14 to 26 bits. The differences between the calculated SQNR and bit-exact simulations are less than 1 dB. Furthermore, an HLS-based FPGA design is implemented on a Xilinx Zynq UltraScale+ XCZU28DR-2FFVG1517E RFSoC. Using less than 25% of the logic fabric on the device, it consumes 7.7 W total on-chip power and has a power efficiency of 12.4 GOPS/W, which is an order of magnitude improvement over an Nvidia Tesla K40 graphics processing unit (GPU) implementation. In terms of throughput, it achieves 50 MS/sec, which is a speedup of 1.6 over a recent optimized FPGA implementation.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"78 2","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0