ACM Transactions on Reconfigurable Technology and Systems最新文献_第8页

Stream Aggregation with Compressed Sliding Windows 流聚合与压缩滑动窗口

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3590774

Prajith Ramakrishnan Geethakumari, Ioannis Sourdis

High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating-point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.

高性能流聚合对于许多分析大量数据的新兴应用程序至关重要。在处理过程中，传入的数据需要存储在滑动窗口中，以防聚合函数不能增量计算。用新的传入值更新窗口并读取窗口以提供聚合函数是流聚合中的两个主要步骤。尽管使用多级队列可以有效地支持窗口更新，但频繁的窗口聚合仍然是性能瓶颈，因为它们给内存带宽和容量带来了巨大的压力。本文通过增强StreamZip来解决这个问题，StreamZip是一个能够压缩滑动窗口的数据流聚合引擎。StreamZip处理了许多数据和控制依赖的挑战，在流聚合管道中集成了一个压缩器，减轻了频繁聚合带来的内存压力。此外，StreamZip还集成了一个缓存机制，用于处理传入数据流中的斜键分布。这样，StreamZip提供了更高的吞吐量以及更大的有效窗口容量来支持更大的问题。StreamZip支持多种压缩算法，为整数和浮点数提供无损和有损压缩。与没有压缩的设计相比，StreamZip无损和有损设计的吞吐量分别提高了7.5倍和22倍，同时有效内存容量分别提高了5倍和23倍。

{"title":"Stream Aggregation with Compressed Sliding Windows","authors":"Prajith Ramakrishnan Geethakumari, Ioannis Sourdis","doi":"https://dl.acm.org/doi/10.1145/3590774","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3590774","url":null,"abstract":"High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating-point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"26 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138543671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design Space Exploration of Galois and Fibonacci Configuration Based on Espresso Stream Cipher 基于Espresso流密码的伽罗瓦和斐波那契组态的设计空间探索

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3567428

Zhengyuan Shi, Cheng Chen, Gangqiang Yang, Hailiang Xiong, Fudong Li, Honggang Hu, Zhiguo Wan

Fibonacci and Galois are two different kinds of configurations in stream ciphers. Although many transformations between two configurations have been proposed, there is no sufficient analysis of their FPGA performance. Espresso stream cipher provides an ideal sample to explore such a problem. The 128-bit secret key Espresso is designed in Galois configuration, and there is a Fibonacci-configured Espresso variant proved with the equivalent security level. To fully leverage the efficiency of two configurations, we explore the hardware optimization approaches toward area and throughput, respectively. In short, the FPGA-implemented Fibonacci cipher is more suitable for extremely resource-constrained or high-throughput applications, while the Galois cipher compromises both area and speed. To the best of our knowledge, this is the first work to systematically compare the FPGA performance of cipher configurations under relatively fair cryptographic security. We hope this work can serve as a reference for the cryptography hardware architecture research community.

斐波那契和伽罗瓦是流密码中的两种不同的结构。虽然已经提出了两种配置之间的许多转换，但没有对其FPGA性能进行充分的分析。Espresso流密码为探索这类问题提供了一个理想的样本。128位的秘钥Espresso是以伽罗瓦配置设计的，并且有一个fibonacci配置的Espresso变体被证明具有相同的安全级别。为了充分利用两种配置的效率，我们分别探索了针对面积和吞吐量的硬件优化方法。简而言之，fpga实现的斐波那契密码更适合于资源极度受限或高吞吐量的应用，而伽罗瓦密码在面积和速度上都有所妥协。据我们所知，这是第一个在相对公平的加密安全性下系统比较密码配置的FPGA性能的工作。希望本文的工作能够为加密硬件架构的研究提供参考。

{"title":"Design Space Exploration of Galois and Fibonacci Configuration Based on Espresso Stream Cipher","authors":"Zhengyuan Shi, Cheng Chen, Gangqiang Yang, Hailiang Xiong, Fudong Li, Honggang Hu, Zhiguo Wan","doi":"https://dl.acm.org/doi/10.1145/3567428","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3567428","url":null,"abstract":"Fibonacci and Galois are two different kinds of configurations in stream ciphers. Although many transformations between two configurations have been proposed, there is no sufficient analysis of their FPGA performance. Espresso stream cipher provides an ideal sample to explore such a problem. The 128-bit secret key Espresso is designed in Galois configuration, and there is a Fibonacci-configured Espresso variant proved with the equivalent security level. To fully leverage the efficiency of two configurations, we explore the hardware optimization approaches toward area and throughput, respectively. In short, the FPGA-implemented Fibonacci cipher is more suitable for extremely resource-constrained or high-throughput applications, while the Galois cipher compromises both area and speed. To the best of our knowledge, this is the first work to systematically compare the FPGA performance of cipher configurations under relatively fair cryptographic security. We hope this work can serve as a reference for the cryptography hardware architecture research community.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"30 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Empirical Approach to Enhance Performance for Scalable CORDIC-Based Deep Neural Networks 一种增强可扩展cordic深度神经网络性能的经验方法

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3596220

Gopal Raut, Saurabh Karkun, Santosh Kumar Vishvakarma

Practical implementation of deep neural networks (DNNs) demands significant hardware resources, necessitating high computational power and memory bandwidth. While existing field-programmable gate array (FPGA)–based DNN accelerators are primarily optimized for fast single-task performance, cost, energy efficiency, and overall throughput are crucial considerations for their practical use in various applications. This article proposes a performance-centric pipeline Coordinate Rotation Digital Computer (CORDIC)–based MAC unit and implements a scalable CORDIC-based DNN architecture that is area- and power-efficient and has high throughput. The CORDIC-based neuron engine uses bit-rounding to maintain input-output precision and minimal hardware resource overhead. The results demonstrate the versatility of the proposed pipelined MAC, which operates at 460 MHz and allows for higher network throughput. A software-based implementation platform evaluates the proposed MAC operation’s accuracy for more extensive neural networks and complex datasets. The DNN accelerator with parameterized and modular layer-multiplexed architecture is designed. Empirical evaluation through Pareto analysis is used to improve the efficiency of DNN implementations by fixing the arithmetic precision and optimal pipeline stages. The proposed architecture utilizes layer-multiplexing, a technique that effectively reuses a single DNN layer to enhance efficiency while maintaining modularity and adaptability for integrating various network configurations. The proposed CORDIC MAC-based DNN architecture is scalable for any bit-precision network size, and the DNN accelerator is prototyped using the Xilinx Virtex-7 VC707 FPGA board, operating at 66 MHz. The proposed design does not use any Xilinx macros, making it easily adaptable for ASIC implementation. Compared with state-of-the-art designs, the proposed design reduces resource use by 45% and power consumption by 4× without sacrificing performance. The accelerator is validated using the MNIST dataset, achieving 95.06% accuracy, only 0.35% less than other cutting-edge implementations.

深度神经网络(dnn)的实际实现需要大量的硬件资源，需要高计算能力和内存带宽。虽然现有的基于现场可编程门阵列(FPGA)的DNN加速器主要针对快速单任务性能进行了优化，但成本、能源效率和总体吞吐量是其在各种应用中实际使用的关键考虑因素。本文提出了一种以性能为中心的基于管道坐标旋转数字计算机(CORDIC)的MAC单元，并实现了一种可扩展的基于CORDIC的DNN架构，该架构具有面积和功耗效率，并且具有高吞吐量。基于cordic的神经元引擎使用位舍入来保持输入输出精度和最小的硬件资源开销。结果证明了所提出的流水线MAC的多功能性，其工作频率为460 MHz，并允许更高的网络吞吐量。基于软件的实现平台评估了所提出的MAC操作在更广泛的神经网络和复杂数据集上的准确性。设计了参数化、模块化层复用结构的深度神经网络加速器。通过帕累托分析的经验评估，通过确定算法精度和最优管道阶段来提高深度神经网络实现的效率。所提出的架构利用层多路复用，这是一种有效重用单个DNN层以提高效率的技术，同时保持模块化和集成各种网络配置的适应性。提出的基于CORDIC mac的DNN架构可扩展到任何位精度网络规模，DNN加速器使用Xilinx Virtex-7 VC707 FPGA板进行原型设计，工作频率为66 MHz。提出的设计不使用任何赛灵思宏，使其易于适应ASIC实现。与最先进的设计相比，所提出的设计在不牺牲性能的情况下减少了45%的资源使用和4倍的功耗。该加速器使用MNIST数据集进行了验证，准确率达到95.06%，仅比其他先进实现低0.35%。

{"title":"An Empirical Approach to Enhance Performance for Scalable CORDIC-Based Deep Neural Networks","authors":"Gopal Raut, Saurabh Karkun, Santosh Kumar Vishvakarma","doi":"https://dl.acm.org/doi/10.1145/3596220","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3596220","url":null,"abstract":"Practical implementation of deep neural networks (DNNs) demands significant hardware resources, necessitating high computational power and memory bandwidth. While existing field-programmable gate array (FPGA)–based DNN accelerators are primarily optimized for fast single-task performance, cost, energy efficiency, and overall throughput are crucial considerations for their practical use in various applications. This article proposes a performance-centric pipeline Coordinate Rotation Digital Computer (CORDIC)–based MAC unit and implements a scalable CORDIC-based DNN architecture that is area- and power-efficient and has high throughput. The CORDIC-based neuron engine uses bit-rounding to maintain input-output precision and minimal hardware resource overhead. The results demonstrate the versatility of the proposed pipelined MAC, which operates at 460 MHz and allows for higher network throughput. A software-based implementation platform evaluates the proposed MAC operation’s accuracy for more extensive neural networks and complex datasets. The DNN accelerator with parameterized and modular layer-multiplexed architecture is designed. Empirical evaluation through Pareto analysis is used to improve the efficiency of DNN implementations by fixing the arithmetic precision and optimal pipeline stages. The proposed architecture utilizes layer-multiplexing, a technique that effectively reuses a single DNN layer to enhance efficiency while maintaining modularity and adaptability for integrating various network configurations. The proposed CORDIC MAC-based DNN architecture is scalable for any bit-precision network size, and the DNN accelerator is prototyped using the Xilinx Virtex-7 VC707 FPGA board, operating at 66 MHz. The proposed design does not use any Xilinx macros, making it easily adaptable for ASIC implementation. Compared with state-of-the-art designs, the proposed design reduces resource use by 45% and power consumption by 4× without sacrificing performance. The accelerator is validated using the MNIST dataset, achieving 95.06% accuracy, only 0.35% less than other cutting-edge implementations.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"88 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the Special Issue on FPT 2021 FPT 2021特刊简介

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-13 DOI: 10.1145/3603701

A. Koch, W. Zhang

The International Conference on Field-Programmable Technology (FPT) is widely considered to be the premier conference series on reconfigurable technology in the Asia-Pacific region. In 2021, the 20 event in a series was planned to be held on-location in Auckland. However, the Covid-19 pandemic made a traditional in-presence conference impossible, and imposed a purely virtual mode of presentation and discussion. Despite these difficulties, the topics of FPT remain as current as ever. Field programmable devices such as FPGAs offer the advantages of dedicated hardware, e.g., in terms of performance or power efficiency, but with an almost software-like flexibility and ease-of-use. This makes them a highly interesting implementation alternative for domains where the performance or flexibility of off-theshelf computing platforms such as CPUs or GPUs do not suffice, but the use of fully applicationspecific chips (ASICs) is not possible, e.g., due to their very high non-recurring costs and the extreme design effort required to target current silicon fabrication technologies. Reconfigurable technology encompasses a wide range of research topics that must be addressed to advance the field. These include tools and design techniques, architectures for fieldprogrammable systems, and device technology for field-programmable chips. And, last, but not least, a study of how the technology can be leveraged in practice to improve applications, turning the potential technology benefits into actual gains for the end users. With this wide range of topics, and despite the virtual conference mode, FPT 2021 attracted 129 submissions across its four tracks. After a thorough reviewing process, which included a rebuttal phase and at least three reviews for the research papers, 27 contributions could be accepted as full papers (21% acceptance rate), and 14 as short papers (32% overall acceptance rate). After the conference, we invited the best eight papers from the FPT conference to submit extended versions of their work to ACM TRETS. Four author groups accepted this invitation and provided new manuscripts that underwent the full TRETS review-and-revision process. Of the new manuscripts provided, three were revised sufficiently to achieve acceptance in time for inclusion into this special issue:

现场可编程技术国际会议(FPT)被广泛认为是亚太地区最重要的可重构技术系列会议。2021年，计划在奥克兰举行20场系列活动。然而，2019冠状病毒病大流行使传统的现场会议变得不可能，并强制采用纯虚拟的演示和讨论模式。尽管存在这些困难，但FPT的主题仍然像以往一样具有时效性。现场可编程设备(如fpga)具有专用硬件的优点，例如在性能或功率效率方面，但几乎具有类似软件的灵活性和易用性。这使得它们成为一个非常有趣的实现替代方案，适用于cpu或gpu等现成计算平台的性能或灵活性不满足的领域，但不可能使用完全特定于应用的芯片(asic)，例如，由于它们非常高的非重复性成本和针对当前硅制造技术所需的极端设计努力。可重构技术包含了广泛的研究课题，必须解决以推进该领域。这些包括工具和设计技术，现场可编程系统的架构，以及现场可编程芯片的设备技术。最后，但并非最不重要的是，研究如何在实践中利用该技术来改进应用程序，将潜在的技术优势转化为最终用户的实际收益。凭借如此广泛的主题，尽管采用虚拟会议模式，FPT 2021在其四个曲目中吸引了129份提交。经过全面的审查过程，包括一个反驳阶段和至少三次研究论文的审查，27篇论文可以被接受为全文(21%的接受率)，14篇论文被接受为短文(32%的总体接受率)。会议结束后，我们邀请了FPT会议上最好的八篇论文向ACM TRETS提交了他们工作的扩展版本。四个作者小组接受了这一邀请，并提供了经过全面审查和修订过程的新手稿。在提供的新手稿中，有三份经过充分修订，及时获得接受，可列入本期特刊:

{"title":"Introduction to the Special Issue on FPT 2021","authors":"A. Koch, W. Zhang","doi":"10.1145/3603701","DOIUrl":"https://doi.org/10.1145/3603701","url":null,"abstract":"The International Conference on Field-Programmable Technology (FPT) is widely considered to be the premier conference series on reconfigurable technology in the Asia-Pacific region. In 2021, the 20 event in a series was planned to be held on-location in Auckland. However, the Covid-19 pandemic made a traditional in-presence conference impossible, and imposed a purely virtual mode of presentation and discussion. Despite these difficulties, the topics of FPT remain as current as ever. Field programmable devices such as FPGAs offer the advantages of dedicated hardware, e.g., in terms of performance or power efficiency, but with an almost software-like flexibility and ease-of-use. This makes them a highly interesting implementation alternative for domains where the performance or flexibility of off-theshelf computing platforms such as CPUs or GPUs do not suffice, but the use of fully applicationspecific chips (ASICs) is not possible, e.g., due to their very high non-recurring costs and the extreme design effort required to target current silicon fabrication technologies. Reconfigurable technology encompasses a wide range of research topics that must be addressed to advance the field. These include tools and design techniques, architectures for fieldprogrammable systems, and device technology for field-programmable chips. And, last, but not least, a study of how the technology can be leveraged in practice to improve applications, turning the potential technology benefits into actual gains for the end users. With this wide range of topics, and despite the virtual conference mode, FPT 2021 attracted 129 submissions across its four tracks. After a thorough reviewing process, which included a rebuttal phase and at least three reviews for the research papers, 27 contributions could be accepted as full papers (21% acceptance rate), and 14 as short papers (32% overall acceptance rate). After the conference, we invited the best eight papers from the FPT conference to submit extended versions of their work to ACM TRETS. Four author groups accepted this invitation and provided new manuscripts that underwent the full TRETS review-and-revision process. Of the new manuscripts provided, three were revised sufficiently to achieve acceptance in time for inclusion into this special issue:","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 2"},"PeriodicalIF":2.3,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46003244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-05 DOI: https://dl.acm.org/doi/10.1145/3603504

Aman Arora, Atharva Bhamburkar, Aatman Borda, Tanmay Anand, Rishabh Sehgal, Bagus Hanindhito, Pierre-Emmanuel Gaillardon, Jaydeep Kulkarni, Lizy K. John

Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using Logic Blocks (LBs) and Digital Signal Processing (DSP) slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-In-Memory Blocks for FPGAs) RAMs. These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like Deep Learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density, while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying SRAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing, databases, etc. By augmenting an Intel Arria-10-like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55x (1.85x) across microbenchmarks from various applications and a geomean speedup of up to 2.5x across multiple Deep Neural Networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads.

块ram (bram)是fpga的存储单元，为使用逻辑块(LBs)和数字信号处理(DSP)片实现的计算单元提供广泛的片上存储带宽。我们建议修改bram，将其转换为CoMeFa (fpga的内存计算块)ram。这些ram通过在一个块中结合计算和存储能力来提供高度并行的内存计算。CoMeFa ram利用FPGA bram的真正双端口特性，并包含多个可配置的单比特串行处理元件。CoMeFa ram可以用于任何精度的计算，这对于深度学习(DL)等应用非常重要。将CoMeFa ram添加到fpga中可以显著提高其计算密度，同时减少数据移动。我们探索并提出了两种ram架构:comfa - d(针对延迟进行优化)和comfa - a(针对面积进行优化)。与现有的方案相比，CoMeFa ram不需要改变底层SRAM技术，如在同一端口上同时激活多个字行，并且可以实际实现。CoMeFa ram特别适用于并行和计算密集型应用，如DL，但这些通用模块可以在信号处理、数据库等不同应用中找到应用。通过将类似英特尔arria -10的FPGA与CoMeFa-D (CoMeFa-A) ram以3.8%(1.2%)的面积为代价进行扩展，并通过算法改进和高效映射，我们观察到在各种应用的微基准测试中几何加速提高了2.55倍(1.85倍)，在多个深度神经网络中几何加速提高了2.5倍。用fpga中的CoMeFa ram替换全部或部分bram可以使它们更好地加速DL工作负载。

{"title":"CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration","authors":"Aman Arora, Atharva Bhamburkar, Aatman Borda, Tanmay Anand, Rishabh Sehgal, Bagus Hanindhito, Pierre-Emmanuel Gaillardon, Jaydeep Kulkarni, Lizy K. John","doi":"https://dl.acm.org/doi/10.1145/3603504","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603504","url":null,"abstract":"Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using Logic Blocks (LBs) and Digital Signal Processing (DSP) slices. We propose modifying BRAMs to convert them to CoMeFa (<underline>Co</underline>mpute-In-<underline>Me</underline>mory Blocks for <underline>F</underline>PG<underline>A</underline>s) RAMs. These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like Deep Learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density, while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying SRAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing, databases, etc. By augmenting an Intel Arria-10-like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55x (1.85x) across microbenchmarks from various applications and a geomean speedup of up to 2.5x across multiple Deep Neural Networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"192 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-05 DOI: 10.1145/3603504

Aman Arora, Atharva Bhamburkar, Aatman Borda, T. Anand, Rishabh Sehgal, Bagus Hanindhito, P. Gaillardon, J. Kulkarni, L. John

Block random access memories (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using logic blocks and digital signal processing slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-in-Memory Blocks for FPGAs) random access memories (RAMs). These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like deep learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying static RAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing and databases, among others. By augmenting an Intel Arria 10–like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55× (1.85×) across microbenchmarks from various applications and a geomean speedup of up to 2.5× across multiple deep neural networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads.

块随机存取存储器(bram)是fpga的存储单元，为使用逻辑块和数字信号处理片实现的计算单元提供广泛的片上存储器带宽。我们建议修改bram，将其转换为CoMeFa (fpga的内存计算块)随机存取存储器(ram)。这些ram通过在一个块中结合计算和存储能力来提供高度并行的内存计算。CoMeFa ram利用FPGA bram的真正双端口特性，并包含多个可配置的单比特串行处理元件。CoMeFa ram可以用于任何精度的计算，这对于深度学习(DL)等应用非常重要。将CoMeFa ram添加到fpga中可以显著提高其计算密度，同时减少数据移动。我们探索并提出了两种ram架构:comfa - d(针对延迟进行优化)和comfa - a(针对面积进行优化)。与现有的建议相比，CoMeFa RAM不需要改变底层静态RAM技术，如在同一端口上同时激活多个字行，并且可以实现。CoMeFa ram特别适用于并行和计算密集型应用，如DL，但这些通用模块可以在信号处理和数据库等各种应用中找到应用。通过以3.8%(1.2%)的面积为成本，在Intel Arria 10类FPGA上增加CoMeFa-D (CoMeFa-A) ram，并通过算法改进和高效映射，我们观察到在各种应用的微基准测试中，几何加速提高了2.55倍(1.85倍)，在多个深度神经网络中，几何加速提高了2.5倍。用fpga中的CoMeFa ram替换全部或部分bram可以使它们更好地加速DL工作负载。

{"title":"CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration","authors":"Aman Arora, Atharva Bhamburkar, Aatman Borda, T. Anand, Rishabh Sehgal, Bagus Hanindhito, P. Gaillardon, J. Kulkarni, L. John","doi":"10.1145/3603504","DOIUrl":"https://doi.org/10.1145/3603504","url":null,"abstract":"Block random access memories (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using logic blocks and digital signal processing slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-in-Memory Blocks for FPGAs) random access memories (RAMs). These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like deep learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying static RAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing and databases, among others. By augmenting an Intel Arria 10–like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55× (1.85×) across microbenchmarks from various applications and a geomean speedup of up to 2.5× across multiple deep neural networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 34"},"PeriodicalIF":2.3,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44191362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Resource Sharing in Dataflow Circuits 数据流电路中的资源共享

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-02 DOI: https://dl.acm.org/doi/10.1145/3597614

Lana Josipović, Axel Marmet, Andrea Guerrieri, Paolo Ienne

To achieve resource-efficient hardware designs, high-level synthesis tools share (i.e., time-multiplex) functional units among operations of the same type. This optimization is typically performed in conjunction with operation scheduling to ensure the best possible unit usage at each point in time. Dataflow circuits have emerged as an alternative HLS approach to efficiently handle irregular and control-dominated code. However, these circuits do not have a predetermined schedule—in its absence, it is challenging to determine which operations can share a functional unit without a performance penalty. More critically, although sharing seems to imply only some trivial circuitry, time-multiplexing units in dataflow circuits may cause deadlock by blocking certain data transfers and preventing operations from executing. In this paper, we present a technique to automatically identify performance-acceptable resource sharing opportunities in dataflow circuits. More importantly, we describe a sharing mechanism which achieves functionally correct and deadlock-free dataflow designs. On a set of benchmarks obtained from C code, we show that our approach effectively implements resource sharing. It results in significant area savings at a minor performance penalty compared to dataflow circuits which do not support this feature (i.e., it achieves a 64%, 2%, and 18% average reduction in DSPs, LUTs, and FFs, respectively, with an average increase in total execution time of only 2%) and matches the sharing capabilities of a state-of-the-art HLS tool.

为了实现资源效率高的硬件设计，高级综合工具在同一类型的操作之间共享(即，时间复用)功能单元。这种优化通常与操作调度一起执行，以确保每个时间点的最佳单元使用率。数据流电路已经成为HLS的一种替代方法，可以有效地处理不规则和控制主导的代码。然而，这些电路没有预定的时间表——如果没有预定的时间表，就很难确定哪些操作可以共享一个功能单元而不影响性能。更关键的是，虽然共享似乎只意味着一些琐碎的电路，但数据流电路中的时间复用单元可能会阻塞某些数据传输并阻止操作执行，从而导致死锁。在本文中，我们提出了一种在数据流电路中自动识别性能可接受的资源共享机会的技术。更重要的是，我们描述了一种共享机制，实现了功能正确和无死锁的数据流设计。在从C代码获得的一组基准测试中，我们展示了我们的方法有效地实现了资源共享。与不支持此特性的数据流电路相比，它在较小的性能损失下节省了大量的面积(即，它分别实现了dsp、lut和ff的64%、2%和18%的平均减少，而总执行时间的平均增加仅为2%)，并且与最先进的HLS工具的共享功能相匹配。

{"title":"Resource Sharing in Dataflow Circuits","authors":"Lana Josipović, Axel Marmet, Andrea Guerrieri, Paolo Ienne","doi":"https://dl.acm.org/doi/10.1145/3597614","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3597614","url":null,"abstract":"To achieve resource-efficient hardware designs, high-level synthesis tools share (i.e., time-multiplex) functional units among operations of the same type. This optimization is typically performed in conjunction with operation scheduling to ensure the best possible unit usage at each point in time. Dataflow circuits have emerged as an alternative HLS approach to efficiently handle irregular and control-dominated code. However, these circuits do not have a predetermined schedule—in its absence, it is challenging to determine which operations can share a functional unit without a performance penalty. More critically, although sharing seems to imply only some trivial circuitry, time-multiplexing units in dataflow circuits may cause deadlock by blocking certain data transfers and preventing operations from executing. In this paper, we present a technique to automatically identify performance-acceptable resource sharing opportunities in dataflow circuits. More importantly, we describe a sharing mechanism which achieves functionally correct and deadlock-free dataflow designs. On a set of benchmarks obtained from C code, we show that our approach effectively implements resource sharing. It results in significant area savings at a minor performance penalty compared to dataflow circuits which do not support this feature (i.e., it achieves a 64%, 2%, and 18% average reduction in DSPs, LUTs, and FFs, respectively, with an average increase in total execution time of only 2%) and matches the sharing capabilities of a state-of-the-art HLS tool.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"6 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Resource Sharing in Dataflow Circuits 数据流电路中的资源共享

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-06-02 DOI: 10.1145/3597614

Lana Josipović, Axel Marmet, Andrea Guerrieri, P. Ienne

To achieve resource-efficient hardware designs, high-level synthesis tools share (i.e., time-multiplex) functional units among operations of the same type. This optimization is typically performed in conjunction with operation scheduling to ensure the best possible unit usage at each point in time. Dataflow circuits have emerged as an alternative HLS approach to efficiently handle irregular and control-dominated code. However, these circuits do not have a predetermined schedule—in its absence, it is challenging to determine which operations can share a functional unit without a performance penalty. More critically, although sharing seems to imply only some trivial circuitry, time-multiplexing units in dataflow circuits may cause deadlock by blocking certain data transfers and preventing operations from executing. In this paper, we present a technique to automatically identify performance-acceptable resource sharing opportunities in dataflow circuits. More importantly, we describe a sharing mechanism which achieves functionally correct and deadlock-free dataflow designs. On a set of benchmarks obtained from C code, we show that our approach effectively implements resource sharing. It results in significant area savings at a minor performance penalty compared to dataflow circuits which do not support this feature (i.e., it achieves a 64%, 2%, and 18% average reduction in DSPs, LUTs, and FFs, respectively, with an average increase in total execution time of only 2%) and matches the sharing capabilities of a state-of-the-art HLS tool.

为了实现资源效率高的硬件设计，高级综合工具在同一类型的操作之间共享(即，时间复用)功能单元。这种优化通常与操作调度一起执行，以确保每个时间点的最佳单元使用率。数据流电路已经成为HLS的一种替代方法，可以有效地处理不规则和控制主导的代码。然而，这些电路没有预定的时间表——如果没有预定的时间表，就很难确定哪些操作可以共享一个功能单元而不影响性能。更关键的是，虽然共享似乎只意味着一些琐碎的电路，但数据流电路中的时间复用单元可能会阻塞某些数据传输并阻止操作执行，从而导致死锁。在本文中，我们提出了一种在数据流电路中自动识别性能可接受的资源共享机会的技术。更重要的是，我们描述了一种共享机制，实现了功能正确和无死锁的数据流设计。在从C代码获得的一组基准测试中，我们展示了我们的方法有效地实现了资源共享。与不支持此特性的数据流电路相比，它在较小的性能损失下节省了大量的面积(即，它分别实现了dsp、lut和ff的64%、2%和18%的平均减少，而总执行时间的平均增加仅为2%)，并且与最先进的HLS工具的共享功能相匹配。

{"title":"Resource Sharing in Dataflow Circuits","authors":"Lana Josipović, Axel Marmet, Andrea Guerrieri, P. Ienne","doi":"10.1145/3597614","DOIUrl":"https://doi.org/10.1145/3597614","url":null,"abstract":"To achieve resource-efficient hardware designs, high-level synthesis tools share (i.e., time-multiplex) functional units among operations of the same type. This optimization is typically performed in conjunction with operation scheduling to ensure the best possible unit usage at each point in time. Dataflow circuits have emerged as an alternative HLS approach to efficiently handle irregular and control-dominated code. However, these circuits do not have a predetermined schedule—in its absence, it is challenging to determine which operations can share a functional unit without a performance penalty. More critically, although sharing seems to imply only some trivial circuitry, time-multiplexing units in dataflow circuits may cause deadlock by blocking certain data transfers and preventing operations from executing. In this paper, we present a technique to automatically identify performance-acceptable resource sharing opportunities in dataflow circuits. More importantly, we describe a sharing mechanism which achieves functionally correct and deadlock-free dataflow designs. On a set of benchmarks obtained from C code, we show that our approach effectively implements resource sharing. It results in significant area savings at a minor performance penalty compared to dataflow circuits which do not support this feature (i.e., it achieves a 64%, 2%, and 18% average reduction in DSPs, LUTs, and FFs, respectively, with an average increase in total execution time of only 2%) and matches the sharing capabilities of a state-of-the-art HLS tool.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48503386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallelising Control Flow in Dynamic-Scheduling High-Level Synthesis 动态调度高级综合中的并行控制流

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-31 DOI: https://dl.acm.org/doi/10.1145/3599973

Jianyi Cheng, Lana Josipović, John Wickerson, George A. Constantinides

Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at run time when inputs become available. Such approaches promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions within each basic block (BB) of the source program, but parallelism between BBs is under-explored, due to the complexity in run-time control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but in order to simplify the analysis required at compile time they require the BBs to start in strict program order, thus limiting the achievable parallelism and overall performance.

We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at run-time. Using this model, we explore a variety of mechanisms for run-time scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed-up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that on average, our toolflow yields a 4 × speedup from (1) and a 2.9 × speedup from (2), with a negligible area overhead. This increases to a 14.3 × average speedup when combining (1) and (2).

最近，有一种趋势是使用高级综合(HLS)工具生成动态调度的硬件。生成的硬件由使用握手信号连接的组件组成。当输入可用时，这些握手信号在运行时调度组件。这种方法保证了在“不规则”源程序上的卓越性能，例如那些控制流依赖于输入数据的程序。这是以额外面积为代价的。当前的动态调度技术能够很好地利用源程序的每个基本块(BB)内指令之间的并行性，但由于运行时控制流和内存依赖性的复杂性，BB之间的并行性尚未得到充分研究。现有的工具允许不同BBs的一些操作重叠，但是为了简化编译时所需的分析，它们要求BBs以严格的程序顺序开始，从而限制了可实现的并行性和整体性能。我们建立了一个通用的依赖关系模型，用于比较不同动态调度方法在运行时提取最大并行度的能力。使用这个模型，我们探索了运行时调度的各种机制，合并和推广了现有的方法。特别是，我们精确地识别现有调度实现中的限制并定义可能的优化解决方案。我们确定了两个特别有前途的例子，其中编译时开销很小，面积开销最小，但我们能够显著加快执行时间:(1)并行化连续的独立循环;(2)将嵌套循环中的独立内部循环实例作为单独的线程并行化。使用来自相关工作的基准集，我们将我们提出的工具流与最先进的动态调度HLS工具Dynamatic进行比较。我们的结果表明，平均而言，我们的工具流从(1)中获得4倍的加速，从(2)中获得2.9倍的加速，而面积开销可以忽略不计。当结合(1)和(2)时，这将增加到14.3倍的平均加速。

{"title":"Parallelising Control Flow in Dynamic-Scheduling High-Level Synthesis","authors":"Jianyi Cheng, Lana Josipović, John Wickerson, George A. Constantinides","doi":"https://dl.acm.org/doi/10.1145/3599973","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3599973","url":null,"abstract":"Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at run time when inputs become available. Such approaches promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions within each basic block (BB) of the source program, but parallelism between BBs is under-explored, due to the complexity in run-time control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but in order to simplify the analysis required at compile time they require the BBs to start in strict program order, thus limiting the achievable parallelism and overall performance. We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at run-time. Using this model, we explore a variety of mechanisms for run-time scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed-up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that on average, our toolflow yields a 4 × speedup from (1) and a 2.9 × speedup from (2), with a negligible area overhead. This increases to a 14.3 × average speedup when combining (1) and (2).","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"25 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallelising Control Flow in Dynamic-Scheduling High-Level Synthesis 动态调度高级综合中的并行控制流

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-05-31 DOI: 10.1145/3599973

Jianyi Cheng, Lana Josipović, John Wickerson, G. Constantinides

Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at run time when inputs become available. Such approaches promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions within each basic block (BB) of the source program, but parallelism between BBs is under-explored, due to the complexity in run-time control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but in order to simplify the analysis required at compile time they require the BBs to start in strict program order, thus limiting the achievable parallelism and overall performance. We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at run-time. Using this model, we explore a variety of mechanisms for run-time scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed-up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that on average, our toolflow yields a 4 × speedup from (1) and a 2.9 × speedup from (2), with a negligible area overhead. This increases to a 14.3 × average speedup when combining (1) and (2).

最近，有一种趋势是使用高级综合(HLS)工具生成动态调度的硬件。生成的硬件由使用握手信号连接的组件组成。当输入可用时，这些握手信号在运行时调度组件。这种方法保证了在“不规则”源程序上的卓越性能，例如那些控制流依赖于输入数据的程序。这是以额外面积为代价的。当前的动态调度技术能够很好地利用源程序的每个基本块(BB)内指令之间的并行性，但由于运行时控制流和内存依赖性的复杂性，BB之间的并行性尚未得到充分研究。现有的工具允许不同BBs的一些操作重叠，但是为了简化编译时所需的分析，它们要求BBs以严格的程序顺序开始，从而限制了可实现的并行性和整体性能。我们建立了一个通用的依赖关系模型，用于比较不同动态调度方法在运行时提取最大并行度的能力。使用这个模型，我们探索了运行时调度的各种机制，合并和推广了现有的方法。特别是，我们精确地识别现有调度实现中的限制并定义可能的优化解决方案。我们确定了两个特别有前途的例子，其中编译时开销很小，面积开销最小，但我们能够显著加快执行时间:(1)并行化连续的独立循环;(2)将嵌套循环中的独立内部循环实例作为单独的线程并行化。使用来自相关工作的基准集，我们将我们提出的工具流与最先进的动态调度HLS工具Dynamatic进行比较。我们的结果表明，平均而言，我们的工具流从(1)中获得4倍的加速，从(2)中获得2.9倍的加速，而面积开销可以忽略不计。当结合(1)和(2)时，这将增加到14.3倍的平均加速。

{"title":"Parallelising Control Flow in Dynamic-Scheduling High-Level Synthesis","authors":"Jianyi Cheng, Lana Josipović, John Wickerson, G. Constantinides","doi":"10.1145/3599973","DOIUrl":"https://doi.org/10.1145/3599973","url":null,"abstract":"Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at run time when inputs become available. Such approaches promise superior performance on ‘irregular’ source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions within each basic block (BB) of the source program, but parallelism between BBs is under-explored, due to the complexity in run-time control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but in order to simplify the analysis required at compile time they require the BBs to start in strict program order, thus limiting the achievable parallelism and overall performance. We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at run-time. Using this model, we explore a variety of mechanisms for run-time scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed-up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that on average, our toolflow yields a 4 × speedup from (1) and a 2.9 × speedup from (2), with a negligible area overhead. This increases to a 14.3 × average speedup when combining (1) and (2).","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48630221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1