首页 > 最新文献

ACM Transactions on Reconfigurable Technology and Systems最新文献

英文 中文
Strega : An HTTP Server for FPGAs Strega:用于fpga的HTTP服务器
4区 计算机科学 Q1 Computer Science Pub Date : 2023-10-10 DOI: 10.1145/3611312
Fabio Maschi, Gustavo Alonso
The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present S trega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single S trega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.
云计算带来的新机遇、新挑战和新限制正在重塑计算机架构格局。一方面,高级应用程序从专门的硬件中获利,以提高其性能并降低部署成本。另一方面,云提供商通过将基础设施任务卸载给硬件加速器,最大化分配给客户端应用程序的CPU时间。虽然对于网络功能虚拟化和TCP/IP等协议如何做到这一点已经很好理解,但对更高网络层的支持仍然很大程度上缺失,这限制了加速器的潜力。在本文中,我们介绍了S trega,这是一个开源的轻量级HTTP服务器,它支持通过RESTful协议(fpga as-a- function)调用fpga加速函数等关键功能。我们的实验分析表明,单个S trega节点维持每秒1.7 M HTTP请求的吞吐量,端到端延迟低至16 μ S,在这两个指标上都优于运行在32个vcpu上的nginx,甚至可以替代通过PCIe总线的传统OpenCL流。通过这项工作,我们为直接在FPGA上运行微服务铺平了道路,绕过了CPU开销,并在分布式云应用程序中实现了FPGA加速的全部潜力。
{"title":"<scp>Strega</scp> : An HTTP Server for FPGAs","authors":"Fabio Maschi, Gustavo Alonso","doi":"10.1145/3611312","DOIUrl":"https://doi.org/10.1145/3611312","url":null,"abstract":"The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present S trega , an open-source 1 light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single S trega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reprogrammable non-linear circuits using ReRAM for NN accelerators 使用ReRAM用于神经网络加速器的可编程非线性电路
4区 计算机科学 Q1 Computer Science Pub Date : 2023-10-10 DOI: 10.1145/3617894
Rafael Fão de Moura, Luigi Carro
As the massive usage of Artificial Intelligence (AI) techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM (ReRAM) devices to compute Matrix-Vector Multiplication (MVM) in O (1) time complexity is a promising approach, but it’s true that these implementations often fail to cover the diversity of nonlinearities required for modern NN applications. In this work, we propose a novel approach where ReRAMs themselves can be reprogrammed to compute not only the required matrix multiplications, but also the activation functions, softmax, and pooling layers, reducing energy in complex NNs. This approach offers more versatility for researching novel NN layouts compared to custom logic. Results show that our device outperforms analog and digital Field Programmable approaches by up to 8.5x in experiments on real-world human activity recognition and language modeling datasets with Convolutional Neural Networks (CNNs), Generative Pre-trained Transformer (GPT), and Long Short-Term Memory (LSTM) models.
随着人工智能(AI)技术在经济中的广泛应用,研究人员正在探索新的技术来降低神经网络(NN)应用的能耗,特别是随着神经网络复杂性的不断增加。使用模拟电阻性RAM (ReRAM)设备以0(1)时间复杂度计算矩阵向量乘法(MVM)是一种很有前途的方法,但这些实现通常无法覆盖现代神经网络应用所需的非线性多样性。在这项工作中,我们提出了一种新的方法,其中reram本身可以重新编程,不仅可以计算所需的矩阵乘法,还可以计算激活函数,softmax和池化层,从而减少复杂神经网络中的能量。与自定义逻辑相比,这种方法为研究新颖的神经网络布局提供了更多的通用性。结果表明,在使用卷积神经网络(cnn)、生成式预训练变压器(GPT)和长短期记忆(LSTM)模型的现实世界人类活动识别和语言建模数据集的实验中,我们的设备比模拟和数字现场可编程方法高出8.5倍。
{"title":"Reprogrammable non-linear circuits using ReRAM for NN accelerators","authors":"Rafael Fão de Moura, Luigi Carro","doi":"10.1145/3617894","DOIUrl":"https://doi.org/10.1145/3617894","url":null,"abstract":"As the massive usage of Artificial Intelligence (AI) techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM (ReRAM) devices to compute Matrix-Vector Multiplication (MVM) in O (1) time complexity is a promising approach, but it’s true that these implementations often fail to cover the diversity of nonlinearities required for modern NN applications. In this work, we propose a novel approach where ReRAMs themselves can be reprogrammed to compute not only the required matrix multiplications, but also the activation functions, softmax, and pooling layers, reducing energy in complex NNs. This approach offers more versatility for researching novel NN layouts compared to custom logic. Results show that our device outperforms analog and digital Field Programmable approaches by up to 8.5x in experiments on real-world human activity recognition and language modeling datasets with Convolutional Neural Networks (CNNs), Generative Pre-trained Transformer (GPT), and Long Short-Term Memory (LSTM) models.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FDRA: A Framework for Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism FDRA:一个支持多级并行的动态可重构加速器框架
4区 计算机科学 Q1 Computer Science Pub Date : 2023-10-10 DOI: 10.1145/3614224
Yunhui Qiu, Yiqing Mao, Xuchen Gao, Sichao Chen, Jiangnan Li, Wenbo Yin, Lingli Wang
Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open-source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across multiple tasks. In this paper, we propose FDRA, an open-source exploration framework for a heterogeneous system-on-chip (SoC) with a RISC-V processor and a dynamically reconfigurable accelerator (DRA) supporting loop, instruction, and task levels of parallelism. FDRA encompasses parameterized SoC modeling, Verilog generation, source-to-source application code transformation using frontend and DRA compilers, SoC simulation, and FPGA prototyping. FDRA incorporates the extraction of periodic accumulative operators and multidimensional linear load/store operators from nested loops. The DRA enables accessing the shared L2 cache with virtual addresses and supports direct memory access (DMA) with arbitrary start addresses and data lengths. Integrated into the RISC-V Rocket SoC, our DRA achieves a remarkable 55 × acceleration for loop kernels and improves energy efficiency by 29 ×. Compared to state-of-the-art RISC-V vector units, our DRA demonstrates a 2.9 × speed improvement and 3.5 × greater energy efficiency. In contrast to previous CGRA+RISC-V SoCs, our SoC achieves a minimum speedup of 5.2 ×.
粗粒度可重构架构(CGRAs)由于其高灵活性和高能效而成为一种很有前途的加速器。然而,现有的开源作品往往缺乏CGRAs与CPU系统和相应的工具链的集成。此外,很少支持加速器指令流水线来跨多个任务重叠数据通信、计算和配置。在本文中,我们提出了FDRA,这是一个异构片上系统(SoC)的开源探索框架,具有RISC-V处理器和动态可重构加速器(DRA),支持循环,指令和任务并行级别。FDRA包括参数化SoC建模、Verilog生成、使用前端和DRA编译器的源到源应用程序代码转换、SoC仿真和FPGA原型。FDRA结合了从嵌套循环中提取周期性累积操作符和多维线性加载/存储操作符。DRA支持使用虚拟地址访问共享L2缓存,并支持使用任意起始地址和数据长度的直接内存访问(DMA)。集成到RISC-V Rocket SoC中,我们的DRA实现了环路内核的55倍加速,并将能源效率提高了29倍。与先进的RISC-V矢量单元相比,我们的DRA速度提高了2.9倍,能效提高了3.5倍。与之前的CGRA+RISC-V SoC相比,我们的SoC实现了5.2倍的最小加速。
{"title":"FDRA: A Framework for Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism","authors":"Yunhui Qiu, Yiqing Mao, Xuchen Gao, Sichao Chen, Jiangnan Li, Wenbo Yin, Lingli Wang","doi":"10.1145/3614224","DOIUrl":"https://doi.org/10.1145/3614224","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open-source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across multiple tasks. In this paper, we propose FDRA, an open-source exploration framework for a heterogeneous system-on-chip (SoC) with a RISC-V processor and a dynamically reconfigurable accelerator (DRA) supporting loop, instruction, and task levels of parallelism. FDRA encompasses parameterized SoC modeling, Verilog generation, source-to-source application code transformation using frontend and DRA compilers, SoC simulation, and FPGA prototyping. FDRA incorporates the extraction of periodic accumulative operators and multidimensional linear load/store operators from nested loops. The DRA enables accessing the shared L2 cache with virtual addresses and supports direct memory access (DMA) with arbitrary start addresses and data lengths. Integrated into the RISC-V Rocket SoC, our DRA achieves a remarkable 55 × acceleration for loop kernels and improves energy efficiency by 29 ×. Compared to state-of-the-art RISC-V vector units, our DRA demonstrates a 2.9 × speed improvement and 3.5 × greater energy efficiency. In contrast to previous CGRA+RISC-V SoCs, our SoC achieves a minimum speedup of 5.2 ×.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136295683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Constraint-Aware Multi-Technique Approximate High-Level Synthesis for FPGAs fpga约束感知多技术近似高级综合
4区 计算机科学 Q1 Computer Science Pub Date : 2023-10-09 DOI: 10.1145/3624481
Marcos T. Leipnitz, Gabriel L. Nazar
Numerous approximate computing (AC) techniques have been developed to reduce the design costs in error-resilient application domains, such as signal and multimedia processing, data mining, machine learning, and computer vision, to trade-off computation accuracy with area and power savings or performance improvements. Selecting adequate techniques for each application and optimization target is complex but crucial for high-quality results. In this context, Approximate High-Level Synthesis (AHLS) tools have been proposed to alleviate the burden of hand-crafting approximate circuits by automating the exploitation of AC techniques. However, such tools are typically tied to a specific approximation technique or a difficult-to-extend set of techniques whose exploitation is not fully automated or steered by optimization targets. Therefore, available AHLS tools overlook the benefits of expanding the design space by mixing diverse approximation techniques toward meeting specific design objectives with minimum error. In this work, we propose an AHLS design methodology for FPGAs that automatically identifies efficient combinations of multiple approximation techniques for different applications and design constraints. Compared to single-technique approaches, decreases of up to 30% in mean squared error and absolute increases of up to 6.5% in percentage accuracy were obtained for a set of image, video, signal processing and machine learning benchmarks.
许多近似计算(AC)技术已经被开发出来,以降低错误弹性应用领域的设计成本,例如信号和多媒体处理、数据挖掘、机器学习和计算机视觉,以权衡计算精度与节省面积和功耗或性能改进。为每个应用程序和优化目标选择适当的技术是复杂的,但对于高质量的结果至关重要。在这种情况下,近似高级合成(AHLS)工具被提出,通过自动化利用交流技术来减轻手工制作近似电路的负担。然而,这些工具通常与特定的近似技术或难以扩展的一组技术绑定在一起,这些技术的利用不是完全自动化的,也不是由优化目标控制的。因此,现有的AHLS工具忽略了通过混合各种近似技术以最小误差满足特定设计目标来扩展设计空间的好处。在这项工作中,我们提出了一种fpga的AHLS设计方法,该方法可以自动识别不同应用和设计约束的多种近似技术的有效组合。与单一技术方法相比,对于一组图像、视频、信号处理和机器学习基准,均方误差降低高达30%,百分比精度绝对提高高达6.5%。
{"title":"Constraint-Aware Multi-Technique Approximate High-Level Synthesis for FPGAs","authors":"Marcos T. Leipnitz, Gabriel L. Nazar","doi":"10.1145/3624481","DOIUrl":"https://doi.org/10.1145/3624481","url":null,"abstract":"Numerous approximate computing (AC) techniques have been developed to reduce the design costs in error-resilient application domains, such as signal and multimedia processing, data mining, machine learning, and computer vision, to trade-off computation accuracy with area and power savings or performance improvements. Selecting adequate techniques for each application and optimization target is complex but crucial for high-quality results. In this context, Approximate High-Level Synthesis (AHLS) tools have been proposed to alleviate the burden of hand-crafting approximate circuits by automating the exploitation of AC techniques. However, such tools are typically tied to a specific approximation technique or a difficult-to-extend set of techniques whose exploitation is not fully automated or steered by optimization targets. Therefore, available AHLS tools overlook the benefits of expanding the design space by mixing diverse approximation techniques toward meeting specific design objectives with minimum error. In this work, we propose an AHLS design methodology for FPGAs that automatically identifies efficient combinations of multiple approximation techniques for different applications and design constraints. Compared to single-technique approaches, decreases of up to 30% in mean squared error and absolute increases of up to 6.5% in percentage accuracy were obtained for a set of image, video, signal processing and machine learning benchmarks.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Partitioned CAM Architecture with FPGA Acceleration for Binary Descriptor Matching 基于FPGA加速的二进制描述符匹配分割CAM体系结构
4区 计算机科学 Q1 Computer Science Pub Date : 2023-10-05 DOI: 10.1145/3624749
Parastoo Soleimani, David W. Capson, Kin Fun Li
An efficient architecture for image descriptor matching that uses a partitioned content-addressable memory (CAM)-based approach is proposed. CAM is frequently used in high-speed content-matching applications. However, due to its lack of functionality to support approximate matching, conventional CAM is not directly useful for image descriptor matching. Our modifications improve the CAM architecture to support approximate content matching for selecting image matches with local binary descriptors. Matches are based on Hamming distances computed for all possible pairs of binary descriptors extracted from two images. We demonstrate an FPGA-based implementation of our CAM-based descriptor matching unit to illustrate the high matching speed of our design. The time complexity of our modified CAM method for binary descriptor matching is O(n). Our method performs binary descriptor matching at a rate of one descriptor per clock cycle at a frequency of 102 MHz. The resource utilization and timing metrics of several experiments are reported to demonstrate the efficacy and scalability of our design.
提出了一种基于分区内容可寻址存储器(CAM)的高效图像描述符匹配体系结构。CAM经常用于高速内容匹配应用程序。然而,由于缺乏支持近似匹配的功能,传统的CAM不能直接用于图像描述符匹配。我们的改进改进了CAM架构,以支持近似内容匹配,以选择具有局部二进制描述符的图像匹配。匹配基于对从两幅图像中提取的所有可能的二进制描述符对计算的汉明距离。我们演示了基于cam的描述符匹配单元的fpga实现,以说明我们设计的高匹配速度。改进的二元描述子匹配CAM方法的时间复杂度为O(n)。我们的方法在102 MHz的频率下以每个时钟周期一个描述符的速率执行二进制描述符匹配。几个实验的资源利用率和时间指标报告证明了我们的设计的有效性和可扩展性。
{"title":"A Partitioned CAM Architecture with FPGA Acceleration for Binary Descriptor Matching","authors":"Parastoo Soleimani, David W. Capson, Kin Fun Li","doi":"10.1145/3624749","DOIUrl":"https://doi.org/10.1145/3624749","url":null,"abstract":"An efficient architecture for image descriptor matching that uses a partitioned content-addressable memory (CAM)-based approach is proposed. CAM is frequently used in high-speed content-matching applications. However, due to its lack of functionality to support approximate matching, conventional CAM is not directly useful for image descriptor matching. Our modifications improve the CAM architecture to support approximate content matching for selecting image matches with local binary descriptors. Matches are based on Hamming distances computed for all possible pairs of binary descriptors extracted from two images. We demonstrate an FPGA-based implementation of our CAM-based descriptor matching unit to illustrate the high matching speed of our design. The time complexity of our modified CAM method for binary descriptor matching is O(n). Our method performs binary descriptor matching at a rate of one descriptor per clock cycle at a frequency of 102 MHz. The resource utilization and timing metrics of several experiments are reported to demonstrate the efficacy and scalability of our design.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135481609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow 高级合成工作流中数据流应用程序的自动缓冲区大小调整
4区 计算机科学 Q1 Computer Science Pub Date : 2023-09-29 DOI: 10.1145/3626103
Alexandre Honorat, Mickaël Dardaillon, Hugo Miomandre, Jean-François Nezan
High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This paper proposes an original method to safely approximate such buffer sizes. The first contribution computes an initial overestimation of buffer sizes, wihout knowing the memory access patterns of kernels. The second contribution iteratively refines those buffer sizes thanks to cosimulation. Moreover, the paper introduces an open source framework using these methods to facilitate dataflow programming on FPGA using HLS. The proposed methods and framework have been tested on 7 dataflow applications, and outperform Vitis HLS cosimulation in 5 benchmarks, either in terms of BRAM and LUT usage, or in term of exploration time. In the 2 other benchmarks, our best method gets results similar to Vitis HLS. Last but not least, our method admits directed cycles in the application graphs.
高级综合(High-Level Synthesis, HLS)工具已经足够成熟,可以为FPGA硬件上的计算内核提供高效的代码生成。对于更复杂的应用程序,多个内核可以通过数据流图连接起来。尽管一些工具(如Xilinx Vitis HLS)支持数据流指令,但它们缺乏有效的分析方法来计算数据流图中内核之间的缓冲区大小。本文提出了一种安全近似缓冲区大小的原始方法。第一个贡献是在不知道内核的内存访问模式的情况下计算缓冲区大小的初始高估。第二个贡献是通过联合模拟迭代地细化缓冲区大小。此外,本文还介绍了一个使用这些方法的开源框架,以便使用HLS在FPGA上进行数据流编程。所提出的方法和框架已经在7个数据流应用程序上进行了测试,并且在5个基准测试中,无论是在BRAM和LUT使用方面,还是在探索时间方面,都优于Vitis HLS联合模拟。在另外两个基准测试中,我们的最佳方法得到的结果与Vitis HLS相似。最后但并非最不重要的是,我们的方法在应用图中允许有向循环。
{"title":"Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow","authors":"Alexandre Honorat, Mickaël Dardaillon, Hugo Miomandre, Jean-François Nezan","doi":"10.1145/3626103","DOIUrl":"https://doi.org/10.1145/3626103","url":null,"abstract":"High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This paper proposes an original method to safely approximate such buffer sizes. The first contribution computes an initial overestimation of buffer sizes, wihout knowing the memory access patterns of kernels. The second contribution iteratively refines those buffer sizes thanks to cosimulation. Moreover, the paper introduces an open source framework using these methods to facilitate dataflow programming on FPGA using HLS. The proposed methods and framework have been tested on 7 dataflow applications, and outperform Vitis HLS cosimulation in 5 benchmarks, either in terms of BRAM and LUT usage, or in term of exploration time. In the 2 other benchmarks, our best method gets results similar to Vitis HLS. Last but not least, our method admits directed cycles in the application graphs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135246216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Programmable Analog System Benchmarks leading to Efficient Analog Computation Synthesis 可编程模拟系统基准导致有效的模拟计算合成
4区 计算机科学 Q1 Computer Science Pub Date : 2023-09-29 DOI: 10.1145/3625298
Jennifer Hasler, Cong Hao
This effort develops the first rich suite of analog & mixed-signal benchmark of various sizes and domains, intended for use with contemporary analog and mixed-signal designs and synthesis tools. Benchmarking enables analog-digital co-design exploration as well as extensive evaluation of analog synthesis tools and the generated analog/mixed-signal circuit or device. The goals of this effort are defining analog computation system benchmarks, developing the required concepts for higher-level analog & mixed-signal tools to utilize these benchmarks, and enabling future automated architectural design space exploration (DSE) to determine the best configurable architecture (e.g., a new FPAA) for a certain family of applications. The benchmarks comprise multiple levels of an acoustic , a vision , a communications , and an analog filter system that must be simultaneously satisfied for a complete system.
这一努力开发了第一套丰富的模拟& &;各种尺寸和领域的混合信号基准,旨在与当代模拟和混合信号设计和合成工具一起使用。基准测试使模拟-数字协同设计探索以及模拟合成工具和生成的模拟/混合信号电路或设备的广泛评估成为可能。这项工作的目标是定义模拟计算系统基准,为更高级别的模拟开发所需的概念。混合信号工具来利用这些基准,并使未来的自动化架构设计空间探索(DSE)能够确定特定应用系列的最佳可配置架构(例如,新的FPAA)。基准包括声学、视觉、通信和模拟滤波系统的多个级别,必须同时满足一个完整的系统。
{"title":"Programmable Analog System Benchmarks leading to Efficient Analog Computation Synthesis","authors":"Jennifer Hasler, Cong Hao","doi":"10.1145/3625298","DOIUrl":"https://doi.org/10.1145/3625298","url":null,"abstract":"This effort develops the first rich suite of analog &amp; mixed-signal benchmark of various sizes and domains, intended for use with contemporary analog and mixed-signal designs and synthesis tools. Benchmarking enables analog-digital co-design exploration as well as extensive evaluation of analog synthesis tools and the generated analog/mixed-signal circuit or device. The goals of this effort are defining analog computation system benchmarks, developing the required concepts for higher-level analog &amp; mixed-signal tools to utilize these benchmarks, and enabling future automated architectural design space exploration (DSE) to determine the best configurable architecture (e.g., a new FPAA) for a certain family of applications. The benchmarks comprise multiple levels of an acoustic , a vision , a communications , and an analog filter system that must be simultaneously satisfied for a complete system.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135246909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tailor : Altering Skip Connections for Resource-Efficient Inference 裁剪:为资源效率推断改变跳过连接
4区 计算机科学 Q1 Computer Science Pub Date : 2023-09-22 DOI: 10.1145/3624990
Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Abarajithan G, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner
Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor , a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network’s skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture.
深度神经网络使用跳跃连接来提高训练收敛性。然而,这些跳过连接在硬件上是昂贵的,需要额外的缓冲区,增加片上和片外内存的利用率和带宽需求。在本文中,我们展示了当使用硬件软件协同设计方法处理时,跳过连接可以针对硬件进行优化。我们认为,虽然网络的跳过连接是网络学习所必需的,但它们可以在以后被删除或缩短,以提供一个更有效的硬件实现,并且最小到没有精度损失。本文介绍了协同设计工具Tailor,该工具的硬件感知训练算法逐步去除或缩短完全训练好的网络的跳过连接,以降低其硬件成本。对于片上数据流风格的架构,Tailor可将bram的资源利用率提高34%,ff的资源利用率提高13%,lut的资源利用率提高16%。Tailor将2D处理元素阵列架构的性能提高30%,并将内存带宽降低45%。
{"title":"<scp>Tailor</scp> : Altering Skip Connections for Resource-Efficient Inference","authors":"Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Abarajithan G, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner","doi":"10.1145/3624990","DOIUrl":"https://doi.org/10.1145/3624990","url":null,"abstract":"Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor , a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network’s skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136059961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High-Efficiency TRNG Design based on Multi-bit Dual-ring Oscillator 基于多位双环振荡器的高效TRNG设计
4区 计算机科学 Q1 Computer Science Pub Date : 2023-09-21 DOI: 10.1145/3624991
Yingchun Lu, Yun Yang, Rong Hu, Huaguo Liang, Maoxiang Yi, Huang Zhengfeng, Yuanming Ma, Tian Chen, Liang Yao
Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, etc. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents an ultra-compact TRNG with high throughput based on a novel extendable dual-ring oscillator (DRO). Owing to multiple bits output per cycle in DRO can be used to obtain the original random sequence, the proposed DRO achieves a maximum resource utilization to build a more efficient TRNG, compared with the conventional TRNG system based on ring oscillator (RO), which only has a single output and needs to build multiple groups of ring oscillators. TRNG based on the 2-bit DRO and its 8-bit derivative structure has been verified on Xilinx Artix-7 and Kintex-7 FPGA under the automatic layout and routing and has achieved a throughput of 550Mbps and 1100Mbps respectively. Moreover, in terms of throughput performance over operating frequency, hardware consumption, and entropy, the proposed scheme has obvious advantages. Finally, the generated sequences show good randomness in the test of NIST SP800-22 and Dieharder test suite and pass the entropy estimation test kit NIST SP800-90B and AIS-31.
信息加密、密钥生成、反侧信道分析掩码生成、算法初始化等安全技术领域都需要不可预测的真随机数。目前,真随机数生成器(TRNG)不足以通过低速比特生成来提供快速的随机比特。因此,有必要设计一个更快的TRNG。本文提出了一种基于新型可扩展双环振荡器(DRO)的高通量超紧凑TRNG。由于DRO中每个周期可以输出多个比特来获得原始随机序列,与传统基于环形振荡器(RO)的TRNG系统只有一个输出,需要构建多组环形振荡器相比,本文提出的DRO实现了最大的资源利用率,构建了更高效的TRNG系统。基于2位DRO及其8位衍生结构的TRNG在Xilinx Artix-7和Kintex-7 FPGA上进行了自动布局和路由下的验证,吞吐量分别达到了550Mbps和1100Mbps。此外,在工作频率、硬件消耗和熵的吞吐量性能方面,该方案具有明显的优势。最后,生成的序列在NIST SP800-22和Dieharder测试套件的测试中显示出良好的随机性,并通过了熵估计测试套件NIST SP800-90B和AIS-31。
{"title":"High-Efficiency TRNG Design based on Multi-bit Dual-ring Oscillator","authors":"Yingchun Lu, Yun Yang, Rong Hu, Huaguo Liang, Maoxiang Yi, Huang Zhengfeng, Yuanming Ma, Tian Chen, Liang Yao","doi":"10.1145/3624991","DOIUrl":"https://doi.org/10.1145/3624991","url":null,"abstract":"Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, etc. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents an ultra-compact TRNG with high throughput based on a novel extendable dual-ring oscillator (DRO). Owing to multiple bits output per cycle in DRO can be used to obtain the original random sequence, the proposed DRO achieves a maximum resource utilization to build a more efficient TRNG, compared with the conventional TRNG system based on ring oscillator (RO), which only has a single output and needs to build multiple groups of ring oscillators. TRNG based on the 2-bit DRO and its 8-bit derivative structure has been verified on Xilinx Artix-7 and Kintex-7 FPGA under the automatic layout and routing and has achieved a throughput of 550Mbps and 1100Mbps respectively. Moreover, in terms of throughput performance over operating frequency, hardware consumption, and entropy, the proposed scheme has obvious advantages. Finally, the generated sequences show good randomness in the test of NIST SP800-22 and Dieharder test suite and pass the entropy estimation test kit NIST SP800-90B and AIS-31.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136153407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design TAPA:基于HLS和物理设计协同优化的现代fpga可扩展任务并行数据流编程框架
4区 计算机科学 Q1 Computer Science Pub Date : 2023-09-18 DOI: 10.1145/3609335
Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong
In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge .
在本文中,我们提出了TAPA,一个端到端框架,它将c++任务并行数据流程序编译成高频FPGA加速器。与现有的解决方案相比,TAPA有两个主要优势。首先,TAPA提供了一组方便的api,允许用户轻松地表达灵活和复杂的任务间通信结构。其次,TAPA在HLS编译过程中采用粗粒度的平面图步骤,对潜在的关键路径进行精确的流水线化。此外,TAPA实现了几种专门为现代基于hbm的fpga量身定制的优化技术。在总共43种设计的实验中,我们将平均频率从147 MHz提高到297 MHz(提高了102%),而吞吐量没有损失,资源利用率的变化可以忽略不计。值得注意的是,在16个实验中,我们使最初不可路由的设计平均达到274 MHz。该框架可在https://github.com/UCLA-VAST/tapa上获得,核心平面图模块可在https://github.com/UCLA-VAST/AutoBridge上获得。
{"title":"TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design","authors":"Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong","doi":"10.1145/3609335","DOIUrl":"https://doi.org/10.1145/3609335","url":null,"abstract":"In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge .","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135153380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
ACM Transactions on Reconfigurable Technology and Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1