ACM Transactions on Reconfigurable Technology and Systems最新文献

英文中文

A Partitioned CAM Architecture with FPGA Acceleration for Binary Descriptor Matching 基于FPGA加速的二进制描述符匹配分割CAM体系结构

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-10-05 DOI: 10.1145/3624749

Parastoo Soleimani, David W. Capson, Kin Fun Li

An efficient architecture for image descriptor matching that uses a partitioned content-addressable memory (CAM)-based approach is proposed. CAM is frequently used in high-speed content-matching applications. However, due to its lack of functionality to support approximate matching, conventional CAM is not directly useful for image descriptor matching. Our modifications improve the CAM architecture to support approximate content matching for selecting image matches with local binary descriptors. Matches are based on Hamming distances computed for all possible pairs of binary descriptors extracted from two images. We demonstrate an FPGA-based implementation of our CAM-based descriptor matching unit to illustrate the high matching speed of our design. The time complexity of our modified CAM method for binary descriptor matching is O(n). Our method performs binary descriptor matching at a rate of one descriptor per clock cycle at a frequency of 102 MHz. The resource utilization and timing metrics of several experiments are reported to demonstrate the efficacy and scalability of our design.

提出了一种基于分区内容可寻址存储器(CAM)的高效图像描述符匹配体系结构。CAM经常用于高速内容匹配应用程序。然而，由于缺乏支持近似匹配的功能，传统的CAM不能直接用于图像描述符匹配。我们的改进改进了CAM架构，以支持近似内容匹配，以选择具有局部二进制描述符的图像匹配。匹配基于对从两幅图像中提取的所有可能的二进制描述符对计算的汉明距离。我们演示了基于cam的描述符匹配单元的fpga实现，以说明我们设计的高匹配速度。改进的二元描述子匹配CAM方法的时间复杂度为O(n)。我们的方法在102 MHz的频率下以每个时钟周期一个描述符的速率执行二进制描述符匹配。几个实验的资源利用率和时间指标报告证明了我们的设计的有效性和可扩展性。

引用次数: 0

Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow 高级合成工作流中数据流应用程序的自动缓冲区大小调整

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-29 DOI: 10.1145/3626103

Alexandre Honorat, Mickaël Dardaillon, Hugo Miomandre, Jean-François Nezan

High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This paper proposes an original method to safely approximate such buffer sizes. The first contribution computes an initial overestimation of buffer sizes, wihout knowing the memory access patterns of kernels. The second contribution iteratively refines those buffer sizes thanks to cosimulation. Moreover, the paper introduces an open source framework using these methods to facilitate dataflow programming on FPGA using HLS. The proposed methods and framework have been tested on 7 dataflow applications, and outperform Vitis HLS cosimulation in 5 benchmarks, either in terms of BRAM and LUT usage, or in term of exploration time. In the 2 other benchmarks, our best method gets results similar to Vitis HLS. Last but not least, our method admits directed cycles in the application graphs.

高级综合(High-Level Synthesis, HLS)工具已经足够成熟，可以为FPGA硬件上的计算内核提供高效的代码生成。对于更复杂的应用程序，多个内核可以通过数据流图连接起来。尽管一些工具(如Xilinx Vitis HLS)支持数据流指令，但它们缺乏有效的分析方法来计算数据流图中内核之间的缓冲区大小。本文提出了一种安全近似缓冲区大小的原始方法。第一个贡献是在不知道内核的内存访问模式的情况下计算缓冲区大小的初始高估。第二个贡献是通过联合模拟迭代地细化缓冲区大小。此外，本文还介绍了一个使用这些方法的开源框架，以便使用HLS在FPGA上进行数据流编程。所提出的方法和框架已经在7个数据流应用程序上进行了测试，并且在5个基准测试中，无论是在BRAM和LUT使用方面，还是在探索时间方面，都优于Vitis HLS联合模拟。在另外两个基准测试中，我们的最佳方法得到的结果与Vitis HLS相似。最后但并非最不重要的是，我们的方法在应用图中允许有向循环。

{"title":"Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow","authors":"Alexandre Honorat, Mickaël Dardaillon, Hugo Miomandre, Jean-François Nezan","doi":"10.1145/3626103","DOIUrl":"https://doi.org/10.1145/3626103","url":null,"abstract":"High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This paper proposes an original method to safely approximate such buffer sizes. The first contribution computes an initial overestimation of buffer sizes, wihout knowing the memory access patterns of kernels. The second contribution iteratively refines those buffer sizes thanks to cosimulation. Moreover, the paper introduces an open source framework using these methods to facilitate dataflow programming on FPGA using HLS. The proposed methods and framework have been tested on 7 dataflow applications, and outperform Vitis HLS cosimulation in 5 benchmarks, either in terms of BRAM and LUT usage, or in term of exploration time. In the 2 other benchmarks, our best method gets results similar to Vitis HLS. Last but not least, our method admits directed cycles in the application graphs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135246216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Programmable Analog System Benchmarks leading to Efficient Analog Computation Synthesis 可编程模拟系统基准导致有效的模拟计算合成

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-29 DOI: 10.1145/3625298

Jennifer Hasler, Cong Hao

This effort develops the first rich suite of analog & mixed-signal benchmark of various sizes and domains, intended for use with contemporary analog and mixed-signal designs and synthesis tools. Benchmarking enables analog-digital co-design exploration as well as extensive evaluation of analog synthesis tools and the generated analog/mixed-signal circuit or device. The goals of this effort are defining analog computation system benchmarks, developing the required concepts for higher-level analog & mixed-signal tools to utilize these benchmarks, and enabling future automated architectural design space exploration (DSE) to determine the best configurable architecture (e.g., a new FPAA) for a certain family of applications. The benchmarks comprise multiple levels of an acoustic , a vision , a communications , and an analog filter system that must be simultaneously satisfied for a complete system.

这一努力开发了第一套丰富的模拟& &;各种尺寸和领域的混合信号基准，旨在与当代模拟和混合信号设计和合成工具一起使用。基准测试使模拟-数字协同设计探索以及模拟合成工具和生成的模拟/混合信号电路或设备的广泛评估成为可能。这项工作的目标是定义模拟计算系统基准，为更高级别的模拟开发所需的概念。混合信号工具来利用这些基准，并使未来的自动化架构设计空间探索(DSE)能够确定特定应用系列的最佳可配置架构(例如，新的FPAA)。基准包括声学、视觉、通信和模拟滤波系统的多个级别，必须同时满足一个完整的系统。

引用次数: 0

Tailor : Altering Skip Connections for Resource-Efficient Inference 裁剪:为资源效率推断改变跳过连接

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-22 DOI: 10.1145/3624990

Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Abarajithan G, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner

Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor , a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network’s skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture.

深度神经网络使用跳跃连接来提高训练收敛性。然而，这些跳过连接在硬件上是昂贵的，需要额外的缓冲区，增加片上和片外内存的利用率和带宽需求。在本文中，我们展示了当使用硬件软件协同设计方法处理时，跳过连接可以针对硬件进行优化。我们认为，虽然网络的跳过连接是网络学习所必需的，但它们可以在以后被删除或缩短，以提供一个更有效的硬件实现，并且最小到没有精度损失。本文介绍了协同设计工具Tailor，该工具的硬件感知训练算法逐步去除或缩短完全训练好的网络的跳过连接，以降低其硬件成本。对于片上数据流风格的架构，Tailor可将bram的资源利用率提高34%，ff的资源利用率提高13%，lut的资源利用率提高16%。Tailor将2D处理元素阵列架构的性能提高30%，并将内存带宽降低45%。

{"title":"<scp>Tailor</scp> : Altering Skip Connections for Resource-Efficient Inference","authors":"Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Abarajithan G, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner","doi":"10.1145/3624990","DOIUrl":"https://doi.org/10.1145/3624990","url":null,"abstract":"Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor , a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network’s skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136059961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

High-Efficiency TRNG Design based on Multi-bit Dual-ring Oscillator 基于多位双环振荡器的高效TRNG设计

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-21 DOI: 10.1145/3624991

Yingchun Lu, Yun Yang, Rong Hu, Huaguo Liang, Maoxiang Yi, Huang Zhengfeng, Yuanming Ma, Tian Chen, Liang Yao

Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, etc. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents an ultra-compact TRNG with high throughput based on a novel extendable dual-ring oscillator (DRO). Owing to multiple bits output per cycle in DRO can be used to obtain the original random sequence, the proposed DRO achieves a maximum resource utilization to build a more efficient TRNG, compared with the conventional TRNG system based on ring oscillator (RO), which only has a single output and needs to build multiple groups of ring oscillators. TRNG based on the 2-bit DRO and its 8-bit derivative structure has been verified on Xilinx Artix-7 and Kintex-7 FPGA under the automatic layout and routing and has achieved a throughput of 550Mbps and 1100Mbps respectively. Moreover, in terms of throughput performance over operating frequency, hardware consumption, and entropy, the proposed scheme has obvious advantages. Finally, the generated sequences show good randomness in the test of NIST SP800-22 and Dieharder test suite and pass the entropy estimation test kit NIST SP800-90B and AIS-31.

信息加密、密钥生成、反侧信道分析掩码生成、算法初始化等安全技术领域都需要不可预测的真随机数。目前，真随机数生成器(TRNG)不足以通过低速比特生成来提供快速的随机比特。因此，有必要设计一个更快的TRNG。本文提出了一种基于新型可扩展双环振荡器(DRO)的高通量超紧凑TRNG。由于DRO中每个周期可以输出多个比特来获得原始随机序列，与传统基于环形振荡器(RO)的TRNG系统只有一个输出，需要构建多组环形振荡器相比，本文提出的DRO实现了最大的资源利用率，构建了更高效的TRNG系统。基于2位DRO及其8位衍生结构的TRNG在Xilinx Artix-7和Kintex-7 FPGA上进行了自动布局和路由下的验证，吞吐量分别达到了550Mbps和1100Mbps。此外，在工作频率、硬件消耗和熵的吞吐量性能方面，该方案具有明显的优势。最后，生成的序列在NIST SP800-22和Dieharder测试套件的测试中显示出良好的随机性，并通过了熵估计测试套件NIST SP800-90B和AIS-31。

{"title":"High-Efficiency TRNG Design based on Multi-bit Dual-ring Oscillator","authors":"Yingchun Lu, Yun Yang, Rong Hu, Huaguo Liang, Maoxiang Yi, Huang Zhengfeng, Yuanming Ma, Tian Chen, Liang Yao","doi":"10.1145/3624991","DOIUrl":"https://doi.org/10.1145/3624991","url":null,"abstract":"Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, etc. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents an ultra-compact TRNG with high throughput based on a novel extendable dual-ring oscillator (DRO). Owing to multiple bits output per cycle in DRO can be used to obtain the original random sequence, the proposed DRO achieves a maximum resource utilization to build a more efficient TRNG, compared with the conventional TRNG system based on ring oscillator (RO), which only has a single output and needs to build multiple groups of ring oscillators. TRNG based on the 2-bit DRO and its 8-bit derivative structure has been verified on Xilinx Artix-7 and Kintex-7 FPGA under the automatic layout and routing and has achieved a throughput of 550Mbps and 1100Mbps respectively. Moreover, in terms of throughput performance over operating frequency, hardware consumption, and entropy, the proposed scheme has obvious advantages. Finally, the generated sequences show good randomness in the test of NIST SP800-22 and Dieharder test suite and pass the entropy estimation test kit NIST SP800-90B and AIS-31.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136153407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design TAPA:基于HLS和物理设计协同优化的现代fpga可扩展任务并行数据流编程框架

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-18 DOI: 10.1145/3609335

Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong

In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge .

在本文中，我们提出了TAPA，一个端到端框架，它将c++任务并行数据流程序编译成高频FPGA加速器。与现有的解决方案相比，TAPA有两个主要优势。首先，TAPA提供了一组方便的api，允许用户轻松地表达灵活和复杂的任务间通信结构。其次，TAPA在HLS编译过程中采用粗粒度的平面图步骤，对潜在的关键路径进行精确的流水线化。此外，TAPA实现了几种专门为现代基于hbm的fpga量身定制的优化技术。在总共43种设计的实验中，我们将平均频率从147 MHz提高到297 MHz(提高了102%)，而吞吐量没有损失，资源利用率的变化可以忽略不计。值得注意的是，在16个实验中，我们使最初不可路由的设计平均达到274 MHz。该框架可在https://github.com/UCLA-VAST/tapa上获得，核心平面图模块可在https://github.com/UCLA-VAST/AutoBridge上获得。

{"title":"TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design","authors":"Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong","doi":"10.1145/3609335","DOIUrl":"https://doi.org/10.1145/3609335","url":null,"abstract":"In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge .","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135153380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Montgomery Multiplication Scalable Systolic Designs Optimized for DSP48E2 针对DSP48E2优化的Montgomery倍增可伸缩收缩设计

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-15 DOI: 10.1145/3624571

Louis Noyez, Nadia El Mrabet, Olivier Potin, Pascal Veron

This paper describes an extensive study of the use of DSP48E2 Slices in Ultrascale FPGAs to design hardware versions of the Montgomery Multiplication algorithm for the hardware acceleration of modular multiplications. Our fully scalable systolic architectures result in parallelized, DSP48E2-optimized scheduling of operations analogous to the FIOS block variant of the Montgomery Multiplication. We explore the impacts of different pipelining strategies within DSP blocks, scheduling of operations, processing element configurations, global design structures and their trade-offs in terms of performance and resource costs. We discuss the application of our methodology to multiple types of DSP primitives. We provide ready to use fast, efficient and fully parametrizable designs which can adapt to a wide range of requirements and applications. Implementations are scalable to any operand width. Our most efficient designs can perform 128, 256, 512, 1024, 2048 and 4096 bits Montgomery modular multiplications in 0.0992 μ s, 0.2032 μ s, 0.3952 μ s, 0.7792 μ s, 1.550 μ s and 3.099 μ s using 4, 6, 11, 21, 41 and 82 DSP blocks respectively.

本文描述了在Ultrascale fpga中使用DSP48E2切片的广泛研究，以设计用于模块化乘法硬件加速的Montgomery乘法算法的硬件版本。我们完全可扩展的收缩架构导致并行化，dsp48e2优化的操作调度，类似于蒙哥马利乘法的FIOS块变体。我们探讨了DSP块内不同的流水线策略、操作调度、处理元素配置、全局设计结构及其在性能和资源成本方面的权衡的影响。我们讨论了我们的方法在多种类型的DSP原语中的应用。我们提供随时可用的快速，高效和完全可参数化的设计，可以适应广泛的要求和应用。实现可扩展到任何操作数宽度。我们最有效的设计可以在0.0992 μ s、0.2032 μ s、0.3952 μ s、0.7792 μ s、1.550 μ s和3.099 μ s内分别使用4、6、11、21、41和82个DSP块执行128、256、512、1024、2048和4096位蒙哥马利模乘法。

{"title":"Montgomery Multiplication Scalable Systolic Designs Optimized for DSP48E2","authors":"Louis Noyez, Nadia El Mrabet, Olivier Potin, Pascal Veron","doi":"10.1145/3624571","DOIUrl":"https://doi.org/10.1145/3624571","url":null,"abstract":"This paper describes an extensive study of the use of DSP48E2 Slices in Ultrascale FPGAs to design hardware versions of the Montgomery Multiplication algorithm for the hardware acceleration of modular multiplications. Our fully scalable systolic architectures result in parallelized, DSP48E2-optimized scheduling of operations analogous to the FIOS block variant of the Montgomery Multiplication. We explore the impacts of different pipelining strategies within DSP blocks, scheduling of operations, processing element configurations, global design structures and their trade-offs in terms of performance and resource costs. We discuss the application of our methodology to multiple types of DSP primitives. We provide ready to use fast, efficient and fully parametrizable designs which can adapt to a wide range of requirements and applications. Implementations are scalable to any operand width. Our most efficient designs can perform 128, 256, 512, 1024, 2048 and 4096 bits Montgomery modular multiplications in 0.0992 μ s, 0.2032 μ s, 0.3952 μ s, 0.7792 μ s, 1.550 μ s and 3.099 μ s using 4, 6, 11, 21, 41 and 82 DSP blocks respectively.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135396531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ExHiPR: Extended High-level Partial Reconfiguration for Fast Incremental FPGA Compilation 用于快速增量FPGA编译的扩展高级部分重构

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-14 DOI: 10.1145/3617837

Yuanlong Xiao, Dongjoon Park, Zeyu Jason Niu, Aditya Hota, André DeHon

Partial Reconfiguration (PR) is a key technique in the application design on modern FPGAs. However, current PR tools heavily rely on the developer to manually conduct PR module definition, floorplanning, and flow control at a low level. The existing PR tools do not consider High-Level-Synthesis languages either, which are of great interest to software developers. We propose HiPR, an open-source framework, to bridge the gap between HLS and PR. HiPR allows the developer to define partially reconfigurable C/C++ functions, instead of Verilog modules, to accelerate the FPGA incremental compilation and automate the flow from C/C++ to bitstreams. We use a lightweight Simulated Annealing floorplanner and show that it can produce high-quality PR floorplans an order of magnitude faster than analytic methods. By mapping Rosetta HLS benchmarks, we demonstrate that the incremental compilation can be accelerated by 3–10 × compared with state-of-the-art Xilinx Vitis flow without performance loss, at the cost of 15-67% one-time overlay set-up time.

部分重构是现代fpga应用设计中的一项关键技术。然而，当前的PR工具严重依赖于开发人员手动执行PR模块定义、布局规划和低水平的流程控制。现有的PR工具也不考虑高级合成语言，而高级合成语言是软件开发人员非常感兴趣的。我们提出HiPR，一个开源框架，以弥合HLS和PR之间的差距。HiPR允许开发人员定义部分可重构的C/ c++函数，而不是Verilog模块，以加速FPGA增量编译和自动化从C/ c++到比特流的流动。我们使用了一个轻量级的模拟退火平面规划器，并表明它可以比分析方法更快地生成高质量的PR平面规划。通过映射Rosetta HLS基准，我们证明了增量编译可以比最先进的Xilinx Vitis流加速3-10倍，而不会造成性能损失，代价是一次性覆盖设置时间减少15-67%。

引用次数: 0

The Open-Source DeLiBA2 Hardware/Software Framework for Distributed Storage Accelerators 分布式存储加速器的开源DeLiBA2硬件/软件框架

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-14 DOI: 10.1145/3624482

Babar Khan, Carsten Heinz, Andreas Koch

With the trend towards ever larger “big data” applications, many of the gains achievable by using specialized compute accelerators become diminished due to the growing I/O overheads. While there have been several research efforts into computational storage and FPGA implementations of the NVMe interface, to our knowledge there have been only very limited efforts to move larger parts of the Linux block I/O stack into FPGA-based hardware accelerators. Our hardware/software framework DeLiBA initially addressed this deficiency by allowing high-productivity development of software components of the I/O stack in user instead of kernel space and leverages a proven FPGA SoC framework to quickly compose and deploy the actual FPGA-based I/O accelerators. In its initial form, it achieves 10% higher throughput and up to 2.3× the I/Os per second (IOPS) for a proof-of-concept Ceph accelerator running in a real multi-node Ceph cluster. In DeLiBA2, we have extended the framework further to better support distributed storage systems, specifically by directly integrating the block I/O accelerators with a hardware-accelerated network stack, as well as by accelerating more storage functions. With these improvements, performance grows significantly: The cluster-level speed-ups now reach up to 2.8× for both throughput and IOPS relative to Ceph in software in synthetic benchmarks, and achieve end-to-end wall clock speed-ups of 20% for the real workload of building a large software package.

随着越来越大的“大数据”应用程序的趋势，由于I/O开销的增加，使用专门的计算加速器可以获得的许多收益会减少。虽然已经有一些关于NVMe接口的计算存储和FPGA实现的研究工作，但据我们所知，将Linux块I/O堆栈的大部分移动到基于FPGA的硬件加速器上的努力非常有限。我们的硬件/软件框架DeLiBA最初通过允许在用户空间而不是内核空间中高生产率地开发I/O堆栈的软件组件来解决这一缺陷，并利用经过验证的FPGA SoC框架来快速组成和部署实际的基于FPGA的I/O加速器。在其初始形式中，对于在真实的多节点Ceph集群中运行的概念验证Ceph加速器，它实现了10%的高吞吐量和高达2.3倍的每秒I/ o (IOPS)。在DeLiBA2中，我们进一步扩展了框架，以更好地支持分布式存储系统，特别是通过直接将块I/O加速器与硬件加速的网络堆栈集成，以及通过加速更多的存储功能。有了这些改进，性能显著提高:在综合基准测试中，相对于软件中的Ceph，集群级的吞吐量和IOPS加速现在达到2.8倍，对于构建大型软件包的实际工作负载，端到端时钟加速达到20%。

{"title":"The Open-Source DeLiBA2 Hardware/Software Framework for Distributed Storage Accelerators","authors":"Babar Khan, Carsten Heinz, Andreas Koch","doi":"10.1145/3624482","DOIUrl":"https://doi.org/10.1145/3624482","url":null,"abstract":"With the trend towards ever larger “big data” applications, many of the gains achievable by using specialized compute accelerators become diminished due to the growing I/O overheads. While there have been several research efforts into computational storage and FPGA implementations of the NVMe interface, to our knowledge there have been only very limited efforts to move larger parts of the Linux block I/O stack into FPGA-based hardware accelerators. Our hardware/software framework DeLiBA initially addressed this deficiency by allowing high-productivity development of software components of the I/O stack in user instead of kernel space and leverages a proven FPGA SoC framework to quickly compose and deploy the actual FPGA-based I/O accelerators. In its initial form, it achieves 10% higher throughput and up to 2.3× the I/Os per second (IOPS) for a proof-of-concept Ceph accelerator running in a real multi-node Ceph cluster. In DeLiBA2, we have extended the framework further to better support distributed storage systems, specifically by directly integrating the block I/O accelerators with a hardware-accelerated network stack, as well as by accelerating more storage functions. With these improvements, performance grows significantly: The cluster-level speed-ups now reach up to 2.8× for both throughput and IOPS relative to Ceph in software in synthetic benchmarks, and achieve end-to-end wall clock speed-ups of 20% for the real workload of building a large software package.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134912617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphScale: Scalable Processing on FPGAs for HBM and Large Graphs GraphScale: fpga上HBM和大型图形的可扩展处理

4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-09-13 DOI: 10.1145/3616497

Jonas Dann, Daniel Ritter, Holger Fröning

Recent advances in graph processing on FPGAs promise to alleviate performance bottlenecks with irregular memory access patterns. Such bottlenecks challenge performance for a growing number of important application areas like machine learning and data analytics. While FPGAs denote a promising solution through flexible memory hierarchies and massive parallelism, we argue that current graph processing accelerators either use the off-chip memory bandwidth inefficiently or do not scale well across memory channels. In this work, we propose GraphScale, a scalable graph processing framework for FPGAs. GraphScale combines multi-channel memory with asynchronous graph processing (i. e., for fast convergence on results) and a compressed graph representation (i. e., for efficient usage of memory bandwidth and reduced memory footprint). GraphScale solves common graph problems like breadth-first search, PageRank, and weakly-connected components through modular user-defined functions, a novel two-dimensional partitioning scheme, and a high-performance two-level crossbar design. Additionally, we extend GraphScale to scale to modern high-bandwidth memory (HBM) and reduce partitioning overhead of large graphs with binary packing.

fpga图形处理的最新进展有望缓解不规则内存访问模式带来的性能瓶颈。这些瓶颈挑战了越来越多的重要应用领域的性能，如机器学习和数据分析。虽然fpga通过灵活的内存层次结构和大规模并行性表示有前途的解决方案，但我们认为当前的图形处理加速器要么低效地使用片外内存带宽，要么不能很好地跨内存通道扩展。在这项工作中，我们提出了GraphScale，一个可扩展的fpga图形处理框架。GraphScale将多通道内存与异步图形处理(即，为了快速收敛结果)和压缩图形表示(即，为了有效使用内存带宽和减少内存占用)相结合。GraphScale通过模块化的用户定义函数、新颖的二维分区方案和高性能的两级交叉设计，解决了诸如宽度优先搜索、PageRank和弱连接组件等常见的图形问题。此外，我们扩展了GraphScale以适应现代高带宽内存(HBM)，并通过二进制打包减少大型图的分区开销。

{"title":"GraphScale: Scalable Processing on FPGAs for HBM and Large Graphs","authors":"Jonas Dann, Daniel Ritter, Holger Fröning","doi":"10.1145/3616497","DOIUrl":"https://doi.org/10.1145/3616497","url":null,"abstract":"Recent advances in graph processing on FPGAs promise to alleviate performance bottlenecks with irregular memory access patterns. Such bottlenecks challenge performance for a growing number of important application areas like machine learning and data analytics. While FPGAs denote a promising solution through flexible memory hierarchies and massive parallelism, we argue that current graph processing accelerators either use the off-chip memory bandwidth inefficiently or do not scale well across memory channels. In this work, we propose GraphScale, a scalable graph processing framework for FPGAs. GraphScale combines multi-channel memory with asynchronous graph processing (i. e., for fast convergence on results) and a compressed graph representation (i. e., for efficient usage of memory bandwidth and reduced memory footprint). GraphScale solves common graph problems like breadth-first search, PageRank, and weakly-connected components through modular user-defined functions, a novel two-dimensional partitioning scheme, and a high-performance two-level crossbar design. Additionally, we extend GraphScale to scale to modern high-bandwidth memory (HBM) and reduce partitioning overhead of large graphs with binary packing.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135739706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Transactions on Reconfigurable Technology and Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀