ACM Transactions on Reconfigurable Technology and Systems最新文献_第9页

BLOOP: Boolean Satisifiability-based Optimized Loop Pipelining blop:基于布尔满意度的优化循环流水线

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-30 DOI: https://dl.acm.org/doi/10.1145/3599972

Nicolai Fiege, Peter Zipf

Modulo scheduling is the premier technique for throughput maximization of loops in high-level synthesis by interleaving consecutive loop iterations. The number of clock cycles between data insertions is called initiation interval (II). For throughput maximization, this value should be as low as possible; therefore its minimization is the main optimization goal.

Despite its long historical existence, modulo scheduling always remained a relevant research topic over the last years with many exact and heuristic algorithms available in literature.

Nevertheless, we are able to leverage the scalability of modern Boolean Satisfiability (SAT) solvers to outperform state-of-the-art ILP-based algorithms for latency-optimal modulo scheduling for both integer and rational IIs. Our algorithm is able to compute valid modulo schedules for the whole CHStone and MachSuite benchmark suites, with 99% of the solutions being proven to be throughput-optimal for a timeout of only 10 min per candidate II. For various time limits, not a single tested scheduler from the state-of-the-art is able to compute more verified optimal solutions or even a single schedule with a higher throughput than our proposed approach. Using an HLS toolflow we show that our algorithm can be effectively used to generate Pareto-optimal FPGA implementations regarding throughput and resource usage.

模调度是高阶合成中通过交错连续环路迭代实现环路吞吐量最大化的主要技术。数据插入之间的时钟周期数称为初始间隔(II)。为了实现吞吐量最大化，该值应尽可能低;因此，其最小化是主要的优化目标。尽管模调度有着悠久的历史，但在过去的几年里，模调度一直是一个相关的研究课题，文献中有许多精确的启发式算法。然而，我们能够利用现代布尔可满足性(SAT)求解器的可扩展性，在整数和有理i的延迟最优模调度方面优于最先进的基于ilp的算法。我们的算法能够为整个CHStone和MachSuite基准套件计算有效的模调度，99%的解决方案被证明是吞吐量最优的，每个候选II的超时时间只有10分钟。对于各种时间限制，没有一个经过测试的最先进的调度程序能够计算出经过验证的最优解决方案，甚至没有一个调度程序具有比我们建议的方法更高的吞吐量。使用HLS工具流，我们表明我们的算法可以有效地用于生成关于吞吐量和资源使用的帕累托最优FPGA实现。

{"title":"BLOOP: Boolean Satisifiability-based Optimized Loop Pipelining","authors":"Nicolai Fiege, Peter Zipf","doi":"https://dl.acm.org/doi/10.1145/3599972","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3599972","url":null,"abstract":"Modulo scheduling is the premier technique for throughput maximization of loops in high-level synthesis by interleaving consecutive loop iterations. The number of clock cycles between data insertions is called initiation interval (II). For throughput maximization, this value should be as low as possible; therefore its minimization is the main optimization goal. Despite its long historical existence, modulo scheduling always remained a relevant research topic over the last years with many exact and heuristic algorithms available in literature. Nevertheless, we are able to leverage the scalability of modern Boolean Satisfiability (SAT) solvers to outperform state-of-the-art ILP-based algorithms for latency-optimal modulo scheduling for both integer and rational IIs. Our algorithm is able to compute valid modulo schedules for the whole CHStone and MachSuite benchmark suites, with 99% of the solutions being proven to be throughput-optimal for a timeout of only 10 min per candidate II. For various time limits, not a single tested scheduler from the state-of-the-art is able to compute more verified optimal solutions or even a single schedule with a higher throughput than our proposed approach. Using an HLS toolflow we show that our algorithm can be effectively used to generate Pareto-optimal FPGA implementations regarding throughput and resource usage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"25 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Increasing the Robustness of TERO-TRNGs against Process Variation 提高tero - trng对工艺变化的鲁棒性

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-23 DOI: https://dl.acm.org/doi/10.1145/3597418

Christian Skubich, Peter Reichel, Marc Reichenbach

The Transition Effect Ring Oscillator (TERO) is a popular design for building entropy sources because it is compact, built from digital elements only and is very well suited for FPGAs. However, it is known to be very sensitive to process variation. While the latter is useful for building Physical Unclonable Functions, it is interfering with the application as entropy source.

In this paper, we investigate an approach to increase reliability. We show that adding a third stage eliminates much of the susceptibility to process variation and how a resulting GHz oscillation can be evaluated on an FPGA. The design is supported by physical and stochastic modeling. The physical model is validated using an experiment with dynamically reconfigurable LUTs.

过渡效应环振荡器(TERO)是构建熵源的一种流行设计，因为它结构紧凑，仅由数字元件构建，非常适合fpga。然而，众所周知，它对工艺变化非常敏感。虽然后者对于构建物理不可克隆函数很有用，但它作为熵源干扰了应用程序。在本文中，我们研究了一种提高可靠性的方法。我们表明，增加第三级消除了对工艺变化的大部分敏感性，以及如何在FPGA上评估由此产生的GHz振荡。该设计得到了物理和随机模型的支持。通过动态可重构lut实验对物理模型进行了验证。

引用次数: 0

Increasing the Robustness of TERO-TRNGs Against Process Variation 提高tero - trng对工艺变化的鲁棒性

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-23 DOI: 10.1145/3597418

Christian Skubich, Peter Reichel, M. Reichenbach

The transition effect ring oscillator is a popular design for building entropy sources because it is compact, built from digital elements only, and is very well suited for FPGAs. However, it is known to be quite sensitive to process variation. Although the latter is useful for building physical unclonable functions, it is interfering with the application as an entropy source. In this article, we investigate an approach to increase reliability. We show that adding a third stage eliminates much of the susceptibility to process variation and how a resulting gigahertz oscillation can be evaluated on an FPGA. The design is supported by physical and stochastic modeling. The physical model is validated using an experiment with dynamically reconfigurable look-up tables.

过渡效应环形振荡器是构建熵源的一种流行设计，因为它结构紧凑，仅由数字元件构建，非常适合fpga。然而，众所周知，它对工艺变化非常敏感。尽管后者对于构建物理上不可克隆的函数很有用，但它作为熵源干扰了应用程序。在本文中，我们研究了一种提高可靠性的方法。我们表明，添加第三级消除了对工艺变化的大部分敏感性，以及如何在FPGA上评估由此产生的千兆赫振荡。该设计得到了物理和随机模型的支持。通过动态可重构查表实验验证了物理模型的正确性。

引用次数: 0

An FPGA Accelerator for Genome Variant Calling 基因组变异调用的FPGA加速器

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-22 DOI: https://dl.acm.org/doi/10.1145/3595297

Tiancheng Xu, Scott Rixner, Alan L. Cox

In genome analysis, it is often important to identify variants from a reference genome. However, identifying variants that occur with low frequency can be challenging, as it is computationally intensive to do so accurately. LoFreq is a widely used program that is adept at identifying low frequency variants. This paper presents a design framework for an FPGA-based accelerator for LoFreq. In particular, this accelerator is targeted at virus analysis, which is particularly challenging, compared to human genome analysis, as the characteristics of the data to be analyzed are fundamentally different. Across the design space, this accelerator can achieve up to 120 × speedups on the core computation of LoFreq and speedups of up to 51.7 × across the entire program.

在基因组分析中，从参考基因组中识别变异通常是很重要的。然而，识别频率较低的变体可能具有挑战性，因为要准确地做到这一点需要大量的计算。LoFreq是一个广泛使用的程序，擅长识别低频变异。本文提出了一种基于fpga的LoFreq加速器的设计框架。特别是，该加速器针对的是病毒分析，与人类基因组分析相比，病毒分析尤其具有挑战性，因为待分析数据的特征根本不同。在整个设计空间中，该加速器可以在LoFreq的核心计算上实现高达120倍的加速，并在整个程序中实现高达51.7倍的加速。

引用次数: 0

Exploring FPGA Switch-Blocks without Explicit Pattern Listing 探索FPGA开关块没有明确的模式列表

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-17 DOI: 10.1145/3597417

Stefan Nikolic, P. Ienne

Increased lower metal resistance makes physical aspects of Field-Programmable Gate Array (FPGA) switch-blocks more relevant than before. The need to navigate a design space where each individual switch can have significant impact on the FPGA’s performance in turn makes automated switch-pattern exploration techniques increasingly appealing. However, most existing exploration techniques have a fundamental limitation—they use the CAD tools as a black box to evaluate the performance of explicitly listed switch-patterns. Given the time needed to route a modern circuit on a single architecture, the number of switch-patterns that can be explicitly tested quickly becomes negligible compared to the size of the design space. This paper presents a technique that removes this fundamental limitation by making the entire design space visible to the router and letting it choose the switches to be added to the pattern, based on the requirements of the circuits being routed. The key to preventing the router from selecting arbitrary switches that would render the final pattern excessively large is to apply the same negotiation principle used by the router to remove congestion, just in the opposite direction, to make the signals reach a consensus on which switches are worthy of being included in the final switch-pattern.

较低金属电阻的增加使现场可编程门阵列（FPGA）开关块的物理方面比以前更具相关性。在设计空间中，每个单独的交换机都可能对FPGA的性能产生重大影响，这反过来又使得自动交换机模式探索技术越来越有吸引力。然而，大多数现有的探索技术都有一个根本的局限性——它们将CAD工具用作黑盒来评估明确列出的开关模式的性能。考虑到在单个架构上路由现代电路所需的时间，与设计空间的大小相比，可以快速明确测试的开关模式的数量变得微不足道。本文提出了一种技术，通过使整个设计空间对路由器可见，并根据路由电路的要求选择要添加到模式中的交换机，来消除这一基本限制。防止路由器选择会使最终模式过大的任意交换机的关键是应用路由器使用的相同协商原则来消除拥塞，只是方向相反，以使信号就哪些交换机值得包括在最终交换机模式中达成共识。

{"title":"Exploring FPGA Switch-Blocks without Explicit Pattern Listing","authors":"Stefan Nikolic, P. Ienne","doi":"10.1145/3597417","DOIUrl":"https://doi.org/10.1145/3597417","url":null,"abstract":"Increased lower metal resistance makes physical aspects of Field-Programmable Gate Array (FPGA) switch-blocks more relevant than before. The need to navigate a design space where each individual switch can have significant impact on the FPGA’s performance in turn makes automated switch-pattern exploration techniques increasingly appealing. However, most existing exploration techniques have a fundamental limitation—they use the CAD tools as a black box to evaluate the performance of explicitly listed switch-patterns. Given the time needed to route a modern circuit on a single architecture, the number of switch-patterns that can be explicitly tested quickly becomes negligible compared to the size of the design space. This paper presents a technique that removes this fundamental limitation by making the entire design space visible to the router and letting it choose the switches to be added to the pattern, based on the requirements of the circuits being routed. The key to preventing the router from selecting arbitrary switches that would render the final pattern excessively large is to apply the same negotiation principle used by the router to remove congestion, just in the opposite direction, to make the signals reach a consensus on which switches are worthy of being included in the final switch-pattern.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45928886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Empirical Approach to Enhance Performance for Scalable CORDIC-Based Deep Neural Networks 一种提高基于可扩展CORDIC的深度神经网络性能的经验方法

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-08 DOI: 10.1145/3596220

Gopal R. Raut, Saurabh Karkun, S. Vishvakarma

Practical implementation of deep neural networks (DNNs) demands significant hardware resources, necessitating high computational power and memory bandwidth. While existing field-programmable gate array (FPGA)–based DNN accelerators are primarily optimized for fast single-task performance, cost, energy efficiency, and overall throughput are crucial considerations for their practical use in various applications. This article proposes a performance-centric pipeline Coordinate Rotation Digital Computer (CORDIC)–based MAC unit and implements a scalable CORDIC-based DNN architecture that is area- and power-efficient and has high throughput. The CORDIC-based neuron engine uses bit-rounding to maintain input-output precision and minimal hardware resource overhead. The results demonstrate the versatility of the proposed pipelined MAC, which operates at 460 MHz and allows for higher network throughput. A software-based implementation platform evaluates the proposed MAC operation’s accuracy for more extensive neural networks and complex datasets. The DNN accelerator with parameterized and modular layer-multiplexed architecture is designed. Empirical evaluation through Pareto analysis is used to improve the efficiency of DNN implementations by fixing the arithmetic precision and optimal pipeline stages. The proposed architecture utilizes layer-multiplexing, a technique that effectively reuses a single DNN layer to enhance efficiency while maintaining modularity and adaptability for integrating various network configurations. The proposed CORDIC MAC-based DNN architecture is scalable for any bit-precision network size, and the DNN accelerator is prototyped using the Xilinx Virtex-7 VC707 FPGA board, operating at 66 MHz. The proposed design does not use any Xilinx macros, making it easily adaptable for ASIC implementation. Compared with state-of-the-art designs, the proposed design reduces resource use by 45% and power consumption by 4× without sacrificing performance. The accelerator is validated using the MNIST dataset, achieving 95.06% accuracy, only 0.35% less than other cutting-edge implementations.

深度神经网络（DNN）的实际实现需要大量的硬件资源，需要高计算能力和内存带宽。虽然现有的基于现场可编程门阵列（FPGA）的DNN加速器主要针对快速单任务性能进行了优化，但成本、能效和整体吞吐量是其在各种应用中实际使用的关键考虑因素。本文提出了一种以性能为中心的流水线式坐标旋转数字计算机（CORDIC）MAC单元，并实现了一种可扩展的基于CORDIC的DNN架构，该架构具有面积和功率效率，并具有高吞吐量。基于CORDIC的神经元引擎使用位舍入来保持输入输出精度和最小的硬件资源开销。结果证明了所提出的流水线MAC的多功能性，它在460MHz下工作，并允许更高的网络吞吐量。基于软件的实现平台针对更广泛的神经网络和复杂的数据集评估所提出的MAC操作的准确性。设计了具有参数化和模块化层复用结构的DNN加速器。通过Pareto分析的经验评估通过固定算法精度和最佳流水线阶段来提高DNN实现的效率。所提出的体系结构利用了层复用，这是一种有效地重用单个DNN层以提高效率的技术，同时保持了集成各种网络配置的模块性和适应性。所提出的基于CORDIC MAC的DNN架构可扩展到任何比特精度的网络大小，并且DNN加速器是使用Xilinx Virtex-7 VC707 FPGA板原型化的，工作频率为66 MHz。所提出的设计不使用任何Xilinx宏，使其易于适用于ASIC实现。与最先进的设计相比，所提出的设计在不牺牲性能的情况下减少了45%的资源使用和4倍的功耗。该加速器使用MNIST数据集进行了验证，准确率达到95.06%，仅比其他尖端实现低0.35%。

{"title":"An Empirical Approach to Enhance Performance for Scalable CORDIC-Based Deep Neural Networks","authors":"Gopal R. Raut, Saurabh Karkun, S. Vishvakarma","doi":"10.1145/3596220","DOIUrl":"https://doi.org/10.1145/3596220","url":null,"abstract":"Practical implementation of deep neural networks (DNNs) demands significant hardware resources, necessitating high computational power and memory bandwidth. While existing field-programmable gate array (FPGA)–based DNN accelerators are primarily optimized for fast single-task performance, cost, energy efficiency, and overall throughput are crucial considerations for their practical use in various applications. This article proposes a performance-centric pipeline Coordinate Rotation Digital Computer (CORDIC)–based MAC unit and implements a scalable CORDIC-based DNN architecture that is area- and power-efficient and has high throughput. The CORDIC-based neuron engine uses bit-rounding to maintain input-output precision and minimal hardware resource overhead. The results demonstrate the versatility of the proposed pipelined MAC, which operates at 460 MHz and allows for higher network throughput. A software-based implementation platform evaluates the proposed MAC operation’s accuracy for more extensive neural networks and complex datasets. The DNN accelerator with parameterized and modular layer-multiplexed architecture is designed. Empirical evaluation through Pareto analysis is used to improve the efficiency of DNN implementations by fixing the arithmetic precision and optimal pipeline stages. The proposed architecture utilizes layer-multiplexing, a technique that effectively reuses a single DNN layer to enhance efficiency while maintaining modularity and adaptability for integrating various network configurations. The proposed CORDIC MAC-based DNN architecture is scalable for any bit-precision network size, and the DNN accelerator is prototyped using the Xilinx Virtex-7 VC707 FPGA board, operating at 66 MHz. The proposed design does not use any Xilinx macros, making it easily adaptable for ASIC implementation. Compared with state-of-the-art designs, the proposed design reduces resource use by 45% and power consumption by 4× without sacrificing performance. The accelerator is validated using the MNIST dataset, achieving 95.06% accuracy, only 0.35% less than other cutting-edge implementations.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48649357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration RapidStream 2.0:通过部分重构实现延迟不敏感FPGA设计的自动并行实现

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-04-26 DOI: 10.1145/3593025

Licheng Guo, P. Maidee, Yun Zhou, C. Lavin, Eddie Hung, Wuxi Li, Jason Lau, W. Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, A. Kaviani, Zhiru Zhang, J. Cong

FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7 × reduction in compile time and up to 1.3 × increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7 × compile time reduction and 1.3 × frequency increase.

fpga需要比cpu等传统计算平台更长的编译周期。在本文中，我们通过共同优化HLS编译(C-to-RTL)和后端物理实现(RTL-to-bitstream)来缩短总体编译时间。我们提出了一种基于HLS级别的流水线灵活性的分割编译方法，它允许我们为并行放置和路由划分设计。我们概述了一些技术挑战，并通过打破传统FPGA工具流不同阶段之间的传统边界并重新组织它们以实现快速的端到端编译来解决这些挑战。我们的研究产生了RapidStream，一个并行和物理集成的编译框架，它采用C/ c++中的延迟不敏感程序，并生成一个完全放置和路由的实现。我们提出了两种方法。第一种方法(RapidStream 1.0)在将不同的分区拼接在一起时解决了分区间路由冲突。当在Xilinx U250 FPGA上使用一组真实的HLS设计进行测试时，与商业现成的工具链相比，RapidStream的编译时间减少了5-7倍，频率增加了1.3倍。此外，我们提供了使用定制的开源路由器的初步结果，在性能要求较低的情况下，可以将编译时间减少到一个数量级。第二种方法(RapidStream 2.0)使用虚拟引脚防止路由冲突。在Xilinx U280 FPGA上测试，我们观察到编译时间减少了5-7倍，频率提高了1.3倍。

{"title":"RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration","authors":"Licheng Guo, P. Maidee, Yun Zhou, C. Lavin, Eddie Hung, Wuxi Li, Jason Lau, W. Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, A. Kaviani, Zhiru Zhang, J. Cong","doi":"10.1145/3593025","DOIUrl":"https://doi.org/10.1145/3593025","url":null,"abstract":"FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7 × reduction in compile time and up to 1.3 × increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7 × compile time reduction and 1.3 × frequency increase.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48513829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

NAPOLY: A Non-deterministic Automata Processor OverLaY 非确定性自动机处理器叠加

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-04-24 DOI: 10.1145/3593586

Rasha Karakchi, J. Bakos

Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.

确定性和非确定性有限自动机(DFA和NFA)构成了许多大数据应用的核心。最近为DFA/NFA开发特定领域架构(Domain-Specific Architectures, dsa)的工作采用了不同的方法，但是为任意大的模式集、状态激活率和模式匹配率实现一致的吞吐量仍然是一个挑战。在这篇文章中，我们提出了NAPOLY(非确定性自动机处理器覆盖层)，一个FPGA覆盖层和相关的编译器。先前努力的一个常见限制是对NFA大小的限制，以实现所发布的吞吐量。NAPOLY针对快速重新编程进行了优化，以允许硬件的实际时分多路复用，并允许无限大小、无限状态激活率和高模式报告率的nfa的高渐近吞吐量。NAPOLY还允许离线生成具有状态容量和转换容量之间权衡的配置。在本文中，我们(1)使用封装在ANMLZoo基准测试套件中的基准测试来评估NAPOLY，(2)评估使用SAT求解器来分配物理资源，以及(3)将NAPOLY的性能与现有解决方案进行比较。NAPOLY在较大的基准测试、具有较高状态激活频率的基准测试和具有较高报告频率的基准测试中表现最佳。NAPOLY在12个基准测试中的10个中超过了CPU和GPU实现的最快速度。

{"title":"NAPOLY: A Non-deterministic Automata Processor OverLaY","authors":"Rasha Karakchi, J. Bakos","doi":"10.1145/3593586","DOIUrl":"https://doi.org/10.1145/3593586","url":null,"abstract":"Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 25"},"PeriodicalIF":2.3,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41485080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Reconfigurable Architecture for Real-time Event-based Multi-Object Tracking 基于事件的实时多目标跟踪的可重构体系结构

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-04-21 DOI: 10.1145/3593587

Yizhao Gao, Song Wang, Hayden Kwok-Hay So

Although advances in event-based machine vision algorithms have demonstrated unparalleled capabilities in performing some of the most demanding tasks, their implementations under stringent real-time and power constraints in edge systems remain a major challenge. In this work, a reconfigurable hardware-software architecture called REMOT, which performs real-time event-based multi-object tracking on FPGAs, is presented. REMOT performs vision tasks by defining a set of actions over attention units (AUs). These actions allow AUs to track an object candidate autonomously by adjusting its region of attention, and allow information gathered by each AU to be used for making algorithmic-level decisions. Taking advantage of this modular structure, algorithm-architecture codesign can be performed by implementing different parts of the algorithm in either hardware or software for different tradeoffs. Results show that REMOT can process 0.43–2.91 million events per second at 1.75–5.45 watts. Compared with the software baseline, our implementation achieves up to 44 times higher throughput and 35.4 times higher power efficiency. Migrating the Merge operation to hardware further reduces the worst-case latency to be 95 times shorter than the software baseline. By varying the AU configuration and operation, a reduction of 0.59–0.77mW per AU on the programmable logic has also been demonstrated.

尽管基于事件的机器视觉算法在执行一些最苛刻的任务方面表现出了无与伦比的能力，但它们在边缘系统中严格的实时性和功率约束下的实现仍然是一个重大挑战。在这项工作中，提出了一种称为REMOT的可重构硬件-软件架构，该架构在FPGA上执行基于事件的实时多目标跟踪。REMOT通过在注意力单位（AU）上定义一组动作来执行视觉任务。这些动作允许AU通过调整其关注区域来自主跟踪候选对象，并允许每个AU收集的信息用于做出算法级别的决策。利用这种模块化结构，可以通过在硬件或软件中实现算法的不同部分来执行算法架构的代码设计，以进行不同的权衡。结果表明，REMOT在1.75–5.45瓦的功率下每秒可处理43–291万个事件。与软件基线相比，我们的实现实现实现了高达44倍的吞吐量和35.4倍的功率效率。将合并操作迁移到硬件进一步将最坏情况下的延迟缩短到软件基线的95倍。通过改变AU的配置和操作，还证明了可编程逻辑上每个AU可减少0.59–0.77mW。

{"title":"A Reconfigurable Architecture for Real-time Event-based Multi-Object Tracking","authors":"Yizhao Gao, Song Wang, Hayden Kwok-Hay So","doi":"10.1145/3593587","DOIUrl":"https://doi.org/10.1145/3593587","url":null,"abstract":"Although advances in event-based machine vision algorithms have demonstrated unparalleled capabilities in performing some of the most demanding tasks, their implementations under stringent real-time and power constraints in edge systems remain a major challenge. In this work, a reconfigurable hardware-software architecture called REMOT, which performs real-time event-based multi-object tracking on FPGAs, is presented. REMOT performs vision tasks by defining a set of actions over attention units (AUs). These actions allow AUs to track an object candidate autonomously by adjusting its region of attention, and allow information gathered by each AU to be used for making algorithmic-level decisions. Taking advantage of this modular structure, algorithm-architecture codesign can be performed by implementing different parts of the algorithm in either hardware or software for different tradeoffs. Results show that REMOT can process 0.43–2.91 million events per second at 1.75–5.45 watts. Compared with the software baseline, our implementation achieves up to 44 times higher throughput and 35.4 times higher power efficiency. Migrating the Merge operation to hardware further reduces the worst-case latency to be 95 times shorter than the software baseline. By varying the AU configuration and operation, a reduction of 0.59–0.77mW per AU on the programmable logic has also been demonstrated.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44050130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stream Aggregation with Compressed Sliding Windows 流聚合与压缩滑动窗口

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-04-05 DOI: 10.1145/3590774

Prajith Ramakrishnan Geethakumari, I. Sourdis

High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating-point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.

高性能流聚合对于许多分析大量数据的新兴应用程序至关重要。在处理过程中，传入的数据需要存储在滑动窗口中，以防聚合函数不能增量计算。用新的传入值更新窗口并读取窗口以提供聚合函数是流聚合中的两个主要步骤。尽管使用多级队列可以有效地支持窗口更新，但频繁的窗口聚合仍然是性能瓶颈，因为它们给内存带宽和容量带来了巨大的压力。本文通过增强StreamZip来解决这个问题，StreamZip是一个能够压缩滑动窗口的数据流聚合引擎。StreamZip处理了许多数据和控制依赖的挑战，在流聚合管道中集成了一个压缩器，减轻了频繁聚合带来的内存压力。此外，StreamZip还集成了一个缓存机制，用于处理传入数据流中的斜键分布。这样，StreamZip提供了更高的吞吐量以及更大的有效窗口容量来支持更大的问题。StreamZip支持多种压缩算法，为整数和浮点数提供无损和有损压缩。与没有压缩的设计相比，StreamZip无损和有损设计的吞吐量分别提高了7.5倍和22倍，同时有效内存容量分别提高了5倍和23倍。

{"title":"Stream Aggregation with Compressed Sliding Windows","authors":"Prajith Ramakrishnan Geethakumari, I. Sourdis","doi":"10.1145/3590774","DOIUrl":"https://doi.org/10.1145/3590774","url":null,"abstract":"High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating-point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 28"},"PeriodicalIF":2.3,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47102285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0