ACM Transactions on Reconfigurable Technology and Systems最新文献_第3页

DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLS DONGLE 2.0：面向 HLS 的 FPGA 直接协调 NVMe 存储

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-03-05 DOI: 10.1145/3650038

Linus Y. Wong, Jialiang Zhang, Jing (Jane) Li

Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached storage devices. However, the current HLS development environment does not allow direct access to host- or FPGA-attached NVMe storage from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.

To address these challenges, we propose a new HLS storage interface named DONGLE 2.0 that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE 2.0 allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. DONGLE 2.0 is an extension to DONGLE 1.0 [1] but adds support for host-attached storage. While its primary focus is still on FPGA NVMe access in near-storage configurations, the added host storage support ensures its compatibility with platforms that lack native support for FPGA-attached NVMe storage. We implemented a prototype of DONGLE 2.0 using an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD. Our evaluation on various workloads showed a geometric mean speed-up of 2.3 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using FPGA-attached NVMe storage. Moreover, DONGLE 2.0 demonstrated a geometric mean speed-up of 1.5 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using host-attached NVMe storage.

数据规模的快速增长给数据处理带来了巨大的计算和内存挑战。FPGA 加速器和近存储处理已成为应对不断增长的计算和内存需求的引人注目的解决方案。许多基于 FPGA 的加速器通过利用主机连接或 FPGA 连接存储设备的存储能力，在处理大型数据集方面表现出很好的效果。然而，当前的 HLS 开发环境不允许 HLS 代码直接访问主机或 FPGA 附加 NVMe 存储。因此，用户必须经常在 HLS 和主机代码之间切换，才能访问存储中的数据，而这一过程需要繁琐的编程来确保功能的正确性。此外，由于 HLS 代码使用的存储访问方法与 DRAM 截然不同，因此基于 DRAM 平台的 HLS 代码库无法轻松移植到基于 NVMe 的平台，导致代码的可移植性和可重用性受到限制。此外，HLS 内核的频繁暂停以及 CPU 和 FPGA 之间的同步会带来巨大的延迟开销，需要复杂的调度机制来隐藏延迟。为了应对这些挑战，我们提出了一种名为 DONGLE 2.0 的新型 HLS 存储接口，它可以实现直接的 FPGA 协调 NVMe 存储访问。通过为存储和内存访问提供统一接口，DONGLE 2.0 允许单源 HLS 程序针对多个内存/存储设备，从而使代码库更加简洁、可移植和高效。DONGLE 2.0 是对 DONGLE 1.0 [1] 的扩展，但增加了对主机附加存储的支持。虽然它的主要重点仍然是近存储配置中的 FPGA NVMe 访问，但新增的主机存储支持确保了它与缺乏 FPGA 附加 NVMe 存储原生支持的平台的兼容性。我们使用 AMD/Xilinx Alveo U200 FPGA 和 Solidigm DC-P4610 SSD 实现了 DONGLE 2.0 的原型。我们对各种工作负载进行的评估显示，在使用 FPGA 附加 NVMe 存储时，与最先进的商业平台相比，几何平均速度提高了 2.3 倍，代码行数减少了 2.4 倍。此外，与最先进的商业平台相比，DONGLE 2.0 在使用主机连接的 NVMe 存储时的几何平均速度提高了 1.5 倍，代码行数减少了 2.4 倍。

{"title":"DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLS","authors":"Linus Y. Wong, Jialiang Zhang, Jing (Jane) Li","doi":"10.1145/3650038","DOIUrl":"https://doi.org/10.1145/3650038","url":null,"abstract":"Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached storage devices. However, the current HLS development environment does not allow direct access to host- or FPGA-attached NVMe storage from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency. To address these challenges, we propose a new HLS storage interface named DONGLE 2.0 that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE 2.0 allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. DONGLE 2.0 is an extension to DONGLE 1.0 [1] but adds support for host-attached storage. While its primary focus is still on FPGA NVMe access in near-storage configurations, the added host storage support ensures its compatibility with platforms that lack native support for FPGA-attached NVMe storage. We implemented a prototype of DONGLE 2.0 using an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD. Our evaluation on various workloads showed a geometric mean speed-up of 2.3 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using FPGA-attached NVMe storage. Moreover, DONGLE 2.0 demonstrated a geometric mean speed-up of 1.5 × and a reduction in lines of code by 2.4 × compared to the state-of-the-art commercial platform when using host-attached NVMe storage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip ScalaBFS2：基于 HBM 增强型 FPGA 芯片的高性能 BFS 加速器

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-02-29 DOI: 10.1145/3650037

Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin

The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.

在 FPGA 芯片中引入高带宽内存 (HBM)，使得基于 FPGA 的加速器在执行特定算法时可以利用 HBM 的巨大内存带宽来提高性能，这对于访问存储在内存中的图形数据时需要高带宽的广度优先搜索 (BFS) 算法来说尤其如此。与传统的 FPGA-DRAM 平台不同，传统的 FPGA-DRAM 平台由于 DRAM 通道有限，因此内存带宽是宝贵的资源，而配备 HBM 的 FPGA 芯片由于拥有大量的 HBM 通道，因此内存带宽要高得多，但逻辑（LUT、FF 和 BRAM/URAM）资源仍然有限。因此，在 HBM 增强型 FPGA 芯片上设计高性能 BFS 加速器的关键是有效利用逻辑资源，构建尽可能多的处理单元 (PE)，并灵活配置这些处理单元，以便从 HBM 中获得对算法有用的尽可能高的有效内存带宽，而不是片面强调绝对内存带宽。为了尽可能利用 HBM 的有效带宽，ScalaBFS2 以顶点为中心在图中进行 BFS，并提出了包括用于内存访问的独立模块（HBM 阅读器）、多层交叉条和实现混合模式（即能够在推模式和拉模式下工作）算法处理的 PE 等设计，以有效利用 FPGA 逻辑资源。因此，ScalaBFS2 能够在 Xilinx Alveo U280 板的 XCU280 FPGA 芯片（采用 16nm 工艺生产，配置了两个 HBM2 堆栈）上构建多达 128 个 PE，并通过充分利用其 32 个 HBM 内存通道实现了 56.92 GTEPS（每秒千兆遍历边）的性能。与基于同一板卡的最先进图处理系统（即 ReGraph）相比，ScalaBFS2 的性能提升了 2.52 倍～4.40 倍。此外，与运行在采用 7nm 工艺生产并配置了五个 HBM2e 堆栈的 Nvidia A100 GPU 上的 Gunrock 相比，ScalaBFS2 的绝对性能提高了 1.34 倍 ∼ 2.40 倍，能效提高了 7.35 倍 ∼ 13.18 倍。

{"title":"ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip","authors":"Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin","doi":"10.1145/3650037","DOIUrl":"https://doi.org/10.1145/3650037","url":null,"abstract":"The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM platforms where memory bandwidth is the precious resource due to the limited DRAM channels, FPGA chips equipped with HBM have much higher memory bandwidths provided by the large quantities of HBM channels, but still limited amount of logic (LUT, FF, and BRAM/URAM) resources. Therefore, the key to design a high performance BFS accelerator on an HBM-enhanced FPGA chip is to efficiently use the logic resources to build as many as possible Processing Elements (PEs), and configure them flexibly to obtain as high as possible effective memory bandwidth that is useful to the algorithm from the HBM, rather than partially emphasizing the absolute memory bandwidth. To exploit as high as possible effective bandwidth from the HBM, ScalaBFS2 conducts BFS in graphs with the vertex-centric manner, and proposes designs, including the independent module (HBM Reader) for memory accessing, multi-layer crossbar, and PEs that implement hybrid mode (i.e., capable of working in both push and pull modes) algorithm processing, to utilize the FPGA logic resources efficiently. Consequently, ScalaBFS2 is able to build up to 128 PEs on the XCU280 FPGA chip (produced with the 16nm process and configured with two HBM2 stacks) of a Xilinx Alveo U280 board, and achieves the performance of 56.92 GTEPS (Giga Traversed Edges Per Second) by fully using its 32 HBM memory channels. Compared with the state-of-the-art graph processing system (i.e., ReGraph) built on top of the same board, ScalaBFS2 achieves 2.52x ∼ 4.40x performance speedups. Moreover, when compared with Gunrock running on an Nvidia A100 GPU that is produced with the 7nm process and configured with five HBM2e stacks, ScalaBFS2 achieves 1.34x ∼ 2.40x speedups on absolute performance, and 7.35x ∼ 13.18x speedups on power efficiency.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140002890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming AxOMaP：利用数学编程设计基于 FPGA 的近似算术运算器

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-02-19 DOI: 10.1145/3648694

Siva Satyendra Sahoo, Salim Ullah, Akash Kumar

With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference on resource-constrained systems. Approximate computing (AxC) aims to provide disproportionate gains in the power, performance, and area (PPA) of an application by allowing some level of reduction in its behavioral accuracy (BEHAV). Using approximate operators (AxOs) for computer arithmetic forms one of the more prevalent methods of implementing AxC. AxOs provide the additional scope for finer granularity of optimization, compared to only precision scaling of computer arithmetic. To this end, the design of platform-specific and cost-efficient approximate operators forms an important research goal. Recently, multiple works have reported the use of AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of such works limit the use of AI/ML to designing ML-based surrogate functions that are used during iterative optimization processes. To this end, we propose a novel data analysis-driven mathematical programming-based approach to synthesizing approximate operators for FPGAs. Specifically, we formulate mixed integer quadratically constrained programsbased on the results of correlation analysis of the characterization data and use the solutions to enable a more directed search approach for evolutionary optimization algorithms. Compared to traditional evolutionary algorithms-based optimization, we report up to 21% improvement in the hypervolume, for joint optimization of PPA and BEHAV, in the design of signed 8-bit multipliers. Further, we report up to 27% better hypervolume than other state-of-the-art approaches to DSE for FPGA-based application-specific AxOs.

随着机器学习（ML）算法在嵌入式系统中的应用日益广泛，为这些资源受限的系统设计低成本计算机运算的必要性也日益凸显。因此，人们正在积极探索近似计算和随机计算等新兴计算模型，以利用这些算法固有的抗错能力，在资源受限的系统中实现 ML 推断。近似计算（AxC）旨在通过在一定程度上降低应用的行为准确性（BEHAV），使应用的功耗、性能和面积（PPA）获得不成比例的提升。在计算机运算中使用近似算子（AxOs）是实现 AxC 的最普遍方法之一。与计算机运算的精度缩放相比，近似算子为更精细的优化提供了额外的空间。为此，设计特定平台且具有成本效益的近似算子成为一项重要的研究目标。最近，有多项研究报告了使用基于人工智能/ML 的方法合成基于 FPGA 的新型近似算子。然而，大多数此类研究都将人工智能/近似算子的使用局限于设计基于近似算子的代用函数，这些函数在迭代优化过程中使用。为此，我们提出了一种基于数据分析驱动的数学编程新方法，用于合成 FPGA 的近似算子。具体来说，我们根据表征数据的相关性分析结果制定混合整数二次约束程序，并利用这些解决方案为进化优化算法提供更有方向性的搜索方法。与基于进化算法的传统优化方法相比，我们发现在设计带符号 8 位乘法器时，通过 PPA 和 BEHAV 的联合优化，超体积提高了 21%。此外，对于基于 FPGA 的特定应用 AxO，我们报告的超体积比其他最先进的 DSE 方法提高了 27%。

{"title":"AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming","authors":"Siva Satyendra Sahoo, Salim Ullah, Akash Kumar","doi":"10.1145/3648694","DOIUrl":"https://doi.org/10.1145/3648694","url":null,"abstract":"With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference on resource-constrained systems. Approximate computing (AxC) aims to provide disproportionate gains in the power, performance, and area (PPA) of an application by allowing some level of reduction in its behavioral accuracy (BEHAV). Using approximate operators (AxOs) for computer arithmetic forms one of the more prevalent methods of implementing AxC. AxOs provide the additional scope for finer granularity of optimization, compared to only precision scaling of computer arithmetic. To this end, the design of platform-specific and cost-efficient approximate operators forms an important research goal. Recently, multiple works have reported the use of AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of such works limit the use of AI/ML to designing ML-based surrogate functions that are used during iterative optimization processes. To this end, we propose a novel data analysis-driven mathematical programming-based approach to synthesizing approximate operators for FPGAs. Specifically, we formulate mixed integer quadratically constrained programs\u0000based on the results of correlation analysis of the characterization data and use the solutions to enable a more directed search approach for evolutionary optimization algorithms. Compared to traditional evolutionary algorithms-based optimization, we report up to 21% improvement in the hypervolume, for joint optimization of PPA and BEHAV, in the design of signed 8-bit multipliers. Further, we report up to 27% better hypervolume than other state-of-the-art approaches to DSE for FPGA-based application-specific AxOs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139928565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the FPL 2021 Special Section FPL 2021 特别部分简介

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-02-12 DOI: 10.1145/3635115

Diana Göhringer, Georgios Keramidas, Akash Kumar

引用次数: 0

High-Efficiency Compressor Trees for Latest AMD FPGAs 用于最新 AMD FPGA 的高效压缩机树

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-02-10 DOI: 10.1145/3645097

Konstantin J. Hoßfeld, Hans Jakob Damsgaard, Jari Nurmi, Michaela Blott, Thomas B. Preußer

High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On FPGA, these compressor trees are constructed from generalized parallel counters (GPCs) whose architecture is closely tied to the underlying reconfigurable fabric. This work reviews known counter designs and proposes new ones in the context of the new AMD Versal™ fabric. On this basis, we develop a compressor generator featuring variable-sized counters, novel counter composition heuristics, explicit clustering strategies, and case-specific optimizations like logic gate absorption. In comparison to the Vivado™ default implementation, the combination of such a compressor with a novel, highly efficient quaternary adder reduces the LUT footprint across different bit matrix input shapes by 45% for a plain summation and by 46% for a terminal accumulation at a slight cost in critical path delay still allowing an operation well above 500 MHz. We demonstrate the aptness of our solution at examples of low-precision integer dot product accumulation units.

在信号处理和机器学习等高度相关的应用领域中，高扇入点积计算无处不在。特别是，机器学习中使用的数据格式多种多样，这对灵活高效的设计方案提出了挑战。理想情况下，点乘求和是由一个无进位压缩树和一个终端进位递增加法组成的。在 FPGA 上，这些压缩树是由广义并行计数器（GPC）构建的，其架构与底层可重构结构密切相关。这项工作回顾了已知的计数器设计，并在 AMD Versal™ 新结构的背景下提出了新的设计。在此基础上，我们开发了一种压缩器生成器，它具有可变大小的计数器、新颖的计数器组成启发式方法、明确的聚类策略以及针对具体情况的优化（如逻辑门吸收）。与 Vivado™ 默认实现相比，这种压缩器与新颖、高效的四级加法器相结合，在不同的位矩阵输入形状下，普通求和的 LUT 基底面减少了 45%，终端累加的 LUT 基底面减少了 46%，但关键路径延迟略有增加，运行频率仍远远高于 500 MHz。我们通过低精度整数点积累加单元的实例来证明我们的解决方案的适用性。

{"title":"High-Efficiency Compressor Trees for Latest AMD FPGAs","authors":"Konstantin J. Hoßfeld, Hans Jakob Damsgaard, Jari Nurmi, Michaela Blott, Thomas B. Preußer","doi":"10.1145/3645097","DOIUrl":"https://doi.org/10.1145/3645097","url":null,"abstract":"High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On FPGA, these compressor trees are constructed from generalized parallel counters (GPCs) whose architecture is closely tied to the underlying reconfigurable fabric. This work reviews known counter designs and proposes new ones in the context of the new AMD Versal™ fabric. On this basis, we develop a compressor generator featuring variable-sized counters, novel counter composition heuristics, explicit clustering strategies, and case-specific optimizations like logic gate absorption. In comparison to the Vivado™ default implementation, the combination of such a compressor with a novel, highly efficient quaternary adder reduces the LUT footprint across different bit matrix input shapes by 45% for a plain summation and by 46% for a terminal accumulation at a slight cost in critical path delay still allowing an operation well above 500 MHz. We demonstrate the aptness of our solution at examples of low-precision integer dot product accumulation units.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139772659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration 用于深度学习加速的全数字内存计算 FPGA 架构

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-01-15 DOI: 10.1145/3640469

Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang

Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (NCIMD) is also developed to support automatic deployment and mapping of DNN networks. NCIMD provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to (40% ) in our proposed CIM FPGA architecture.

现场可编程门阵列（FPGA）是一种通用的可编程硬件平台，这使其成为加速深度神经网络（DNN）的理想选择。然而，FPGA 的计算能效较低，原因是能耗主要由互连数据移动造成。在本文中，我们提出了一种用于深度学习加速的全数字内存计算 FPGA 架构。此外，我们还提出了数字 CIM 内核的位串行计算电路，用于加速向量矩阵乘法（VMM）运算。我们还开发了网络-CIM-部署器（NCIMD），以支持 DNN 网络的自动部署和映射。NCIMD 提供了 Caffe 格式 DNN 模型的用户友好 API。同时，我们引入了权重静态（WS）数据流，并介绍了将单层网络映射到该架构中的 CIM 阵列的方法。我们使用不同的架构布局和映射策略，在深度学习（DL）领域和非 DL 领域对所提出的 FPGA 架构进行了实验测试。我们还将测试结果与传统的 FPGA 架构进行了比较。实验结果表明，与传统的FPGA架构相比，在我们提出的CIM FPGA架构中，能效最高可实现16.1倍的提速，而时延最高可降低40%。

{"title":"An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration","authors":"Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang","doi":"10.1145/3640469","DOIUrl":"https://doi.org/10.1145/3640469","url":null,"abstract":"Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this paper, we propose an all-digital Compute-In-Memory FPGA architecture for deep learning acceleration. Furthermore, we present a bit-serial computing circuit of the Digital CIM core for accelerating vector-matrix multiplication (VMM) operations. A Network-CIM-Deployer (NCIMD) is also developed to support automatic deployment and mapping of DNN networks. NCIMD provides a user-friendly API of DNN models in Caffe format. Meanwhile, we introduce a Weight-Stationary (WS) dataflow and describe the method of mapping a single layer of the network to the CIM array in the architecture. We conduct experimental tests on the proposed FPGA architecture in the field of Deep Learning (DL), as well as in non-DL fields, using different architectural layouts and mapping strategies. We also compare the results with the conventional FPGA architecture. The experimental results show that compared to the conventional FPGA architecture, the energy efficiency can achieve a maximum speedup of 16.1 ×, while the latency can decrease up to (40% ) in our proposed CIM FPGA architecture.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139469745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the Impact of Using Multiple-Metal Layers on the Layout Area of Switch Blocks for Tile-Based FPGAs in FinFET 7nm 评估使用多金属层对 FinFET 7 纳米瓦片式 FPGA 开关模块布局面积的影响

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2024-01-02 DOI: 10.1145/3639055

Sajjad Rostami Sani, Andy Ye

A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way that reduces the number of vias that are needed to connect different routing tracks together while maintaining the tile-based structure of FPGAs. It also takes into account the wiring area required for buffer insertion for long wire segments. The model is evaluated based on the layouts constructed in ASAP7 FinFET 7nm Predictive Design Kit. We found that the new model, while specific to the layout strategy that it employs, improves upon the traditional active-based area estimation models by considering the growth of the metal area independently from the growth of the active area. As a result, the new model is able to more accurately estimate layout area by predicting when metal area will overtake active area as the number of routing tracks is increased. This ability allows the more accurate estimation of the true layout cost of FPGA fabrics at the early floor planning and architectural exploration stage; and this increase in accuracy can encourage a wider use of custom FPGA fabrics that target specific sets of benchmarks in future SOC designs. Furthermore, our data indicate that the conclusions drawn from several significant prior architectural studies remain to be correct under FinFET geometries and wiring area considerations despite their exclusive use of active-only area models. This correctness is due to the small channel widths, around 30-60 tracks per channel, of the architectures that these studies investigate. For architectures that approach the channel width of modern commercial FPGAs with over one to two hundreds tracks per channel, our data show that wiring area models justified by detailed layout considerations are an essential addition to active area models in the correct prediction of the implementation area of FPGAs.

这项工作引入了一种新的面积模型，用于估算开关模块的布局面积。该模型基于现实的布局策略。因此，它不仅考虑了构建开关模块所需的有效面积，还考虑了可用金属层的数量和这些金属的实际尺寸。该模型为布线轨道分配金属层的方式减少了将不同布线轨道连接在一起所需的通孔数量，同时保持了 FPGA 基于瓦片的结构。它还考虑了长线段缓冲插入所需的布线面积。我们根据 ASAP7 FinFET 7nm 预测设计工具包中构建的布局对该模型进行了评估。我们发现，新模型虽然针对其采用的布局策略，但通过将金属面积的增长与有源面积的增长分开考虑，改进了传统的基于有源面积的面积估算模型。因此，新模型能够预测随着布线轨道数量的增加，金属面积何时会超过活动面积，从而更准确地估算布局面积。这种能力可以在早期平面规划和架构探索阶段更准确地估算 FPGA 结构的真实布局成本；这种准确性的提高可以鼓励在未来的 SOC 设计中更广泛地使用针对特定基准集的定制 FPGA 结构。此外，我们的数据表明，尽管之前的几项重要架构研究专门使用了纯活动面积模型，但在 FinFET 几何结构和布线面积考虑下，这些研究得出的结论仍然是正确的。这种正确性是由于这些研究调查的架构的沟道宽度较小，每个沟道约 30-60 个轨道。对于接近现代商用 FPGA 的通道宽度（每通道超过 1 到 2 百条轨道）的体系结构，我们的数据表明，在正确预测 FPGA 的实施面积时，通过详细布局考虑而确定的布线面积模型是对有源面积模型的重要补充。

{"title":"Evaluating the Impact of Using Multiple-Metal Layers on the Layout Area of Switch Blocks for Tile-Based FPGAs in FinFET 7nm","authors":"Sajjad Rostami Sani, Andy Ye","doi":"10.1145/3639055","DOIUrl":"https://doi.org/10.1145/3639055","url":null,"abstract":"A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way that reduces the number of vias that are needed to connect different routing tracks together while maintaining the tile-based structure of FPGAs. It also takes into account the wiring area required for buffer insertion for long wire segments. The model is evaluated based on the layouts constructed in ASAP7 FinFET 7nm Predictive Design Kit. We found that the new model, while specific to the layout strategy that it employs, improves upon the traditional active-based area estimation models by considering the growth of the metal area independently from the growth of the active area. As a result, the new model is able to more accurately estimate layout area by predicting when metal area will overtake active area as the number of routing tracks is increased. This ability allows the more accurate estimation of the true layout cost of FPGA fabrics at the early floor planning and architectural exploration stage; and this increase in accuracy can encourage a wider use of custom FPGA fabrics that target specific sets of benchmarks in future SOC designs. Furthermore, our data indicate that the conclusions drawn from several significant prior architectural studies remain to be correct under FinFET geometries and wiring area considerations despite their exclusive use of active-only area models. This correctness is due to the small channel widths, around 30-60 tracks per channel, of the architectures that these studies investigate. For architectures that approach the channel width of modern commercial FPGAs with over one to two hundreds tracks per channel, our data show that wiring area models justified by detailed layout considerations are an essential addition to active area models in the correct prediction of the implementation area of FPGAs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CSAIL2019 Crypto-Puzzle Solver Architecture CSAIL2019 密码谜题求解器架构

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-12-29 DOI: 10.1145/3639056

Sergey Gribok, Bogdan Pasca, Martin Langhammer

The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified. In this article, we present several FPGA architectures for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We introduce a class of multi-cycle squarer-based architectures that allow for better resource and area trade-offs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 22 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.

CSAIL2019 时锁谜题是罗恩-里维斯特（Ron Rivest）于 2019 年提出的一项尚未解决的密码挑战，它取代了已解决的 LCS35 谜题。解决这类谜题需要进行大量内在顺序计算，每次迭代都要执行非常大的（CSAIL2019 为 3072 位）模块乘法运算。每次迭代的复杂度是已知 FPGA 实现的数倍，与 LCS35 相比，迭代次数增加了约 1000 倍。由于这一新谜题的复杂性很高，因此已经指定了一些中间版本或里程碑版本。在本文中，我们介绍了 CSAIL2019 解算器的几种 FPGA 架构，并在中型英特尔 Agilex 器件上实现了这些架构。我们开发了一种新的多周期模块乘法，该方法非常灵活，可适用于当前各种尺寸的 FPGA。我们引入了一类基于多周期平方器的架构，可以更好地权衡资源和面积。我们还展示了一种改进大型芯片填充算术设计的拟合和时序闭合的新方法。我们使用求解器计算出了该难题 28 个里程碑式解法中的前 22 个，这是该问题的首个报告结果。

{"title":"CSAIL2019 Crypto-Puzzle Solver Architecture","authors":"Sergey Gribok, Bogdan Pasca, Martin Langhammer","doi":"10.1145/3639056","DOIUrl":"https://doi.org/10.1145/3639056","url":null,"abstract":"The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified. In this article, we present several FPGA architectures for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We introduce a class of multi-cycle squarer-based architectures that allow for better resource and area trade-offs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 22 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the Special Section on FPGA 2022 FPGA 2022 专节简介

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-12-13 DOI: 10.1145/3618114

P. Ienne

引用次数: 0

AEKA: FPGA Implementation of Area-Efficient Karatsuba Accelerator for Ring-Binary-LWE-based Lightweight PQC AEKA：为基于环二进制-LWE 的轻量级 PQC 实现面积效率高的 Karatsuba 加速器的 FPGA 实现

IF 2.3 4区计算机科学 Q1 Computer Science

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-12-11 DOI: 10.1145/3637215

Tianyou Bao, Pengzhou He, Jiafeng Xie, H S. Jacinto

Lightweight PQC-related research and development have gradually gained attention from the research community recently. Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC), a promising lightweight PQC based on small parameter sets to fit related applications (but not in favor of deploying popular fast algorithms like number theoretic transform). To solve this problem, in this paper, we present a novel implementation of hardware acceleration for RBLWE-ENC based on Karatsuba algorithm, particularly on the field-programmable gate array (FPGA) platform. In detail, we have proposed an area-efficient Karatsuba Accelerator (AEKA) for RBLWE-ENC, based on three layers of innovative efforts. First of all, we reformulate the signal processing sequence within the major arithmetic component of the KA-based polynomial multiplication for RBLWE-ENC to obtain a new algorithm. Then, we have designed the proposed algorithm into a new hardware accelerator with several novel algorithm-to-architecture mapping techniques. Finally, we have conducted thorough complexity analysis and comparison to demonstrate the efficiency of the proposed accelerator, e.g., it involves 62.5% higher throughput and 60.2% less area-delay product (ADP) than the state-of-the-art design for n = 512 (Virtex-7 device, similar setup). The proposed AEKA design strategy is highly efficient on the FPGA devices, i.e., small resource usage with superior timing, which can be integrated with other necessary systems for lightweight-oriented high-performance applications (e.g., servers). The outcome of this work is also expected to generate impacts for lightweight PQC advancement.

与轻量级 PQC 相关的研究和开发近来逐渐受到研究界的关注。基于环二进制学习与错误（RBLWE）的加密方案（RBLWE-ENC），是一种基于小参数集的轻量级 PQC，适合相关应用（但不支持部署流行的快速算法，如数论变换），前景广阔。为了解决这个问题，我们在本文中提出了一种基于 Karatsuba 算法的 RBLWE-ENC 硬件加速新实现，特别是在现场可编程门阵列（FPGA）平台上。具体而言，我们基于三层创新努力，为 RBLWE-ENC 提出了一种面积效率高的 Karatsuba 加速器（AEKA）。首先，我们在基于 KA 的 RBLWE-ENC 多项式乘法的主要算术部分中重新制定了信号处理序列，从而获得了一种新算法。然后，我们利用几种新颖的算法到架构映射技术，将所提出的算法设计到一个新的硬件加速器中。最后，我们进行了全面的复杂性分析和比较，以证明所提加速器的效率，例如，在 n = 512（Virtex-7 器件，类似设置）的情况下，它的吞吐量比最先进的设计高 62.5%，面积-延迟积（ADP）比最先进的设计低 60.2%。所提出的 AEKA 设计策略在 FPGA 器件上具有很高的效率，即资源使用量小，时序性能优越，可与其他必要系统集成，用于面向轻量级的高性能应用（如服务器）。这项工作的成果也有望对轻量级 PQC 的发展产生影响。

{"title":"AEKA: FPGA Implementation of Area-Efficient Karatsuba Accelerator for Ring-Binary-LWE-based Lightweight PQC","authors":"Tianyou Bao, Pengzhou He, Jiafeng Xie, H S. Jacinto","doi":"10.1145/3637215","DOIUrl":"https://doi.org/10.1145/3637215","url":null,"abstract":"Lightweight PQC-related research and development have gradually gained attention from the research community recently. Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC), a promising lightweight PQC based on small parameter sets to fit related applications (but not in favor of deploying popular fast algorithms like number theoretic transform). To solve this problem, in this paper, we present a novel implementation of hardware acceleration for RBLWE-ENC based on Karatsuba algorithm, particularly on the field-programmable gate array (FPGA) platform. In detail, we have proposed an area-efficient Karatsuba Accelerator (AEKA) for RBLWE-ENC, based on three layers of innovative efforts. First of all, we reformulate the signal processing sequence within the major arithmetic component of the KA-based polynomial multiplication for RBLWE-ENC to obtain a new algorithm. Then, we have designed the proposed algorithm into a new hardware accelerator with several novel algorithm-to-architecture mapping techniques. Finally, we have conducted thorough complexity analysis and comparison to demonstrate the efficiency of the proposed accelerator, e.g., it involves 62.5% higher throughput and 60.2% less area-delay product (ADP) than the state-of-the-art design for n = 512 (Virtex-7 device, similar setup). The proposed AEKA design strategy is highly efficient on the FPGA devices, i.e., small resource usage with superior timing, which can be integrated with other necessary systems for lightweight-oriented high-performance applications (e.g., servers). The outcome of this work is also expected to generate impacts for lightweight PQC advancement.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":2.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138566009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0