ACM Transactions on Reconfigurable Technology and Systems最新文献_第10页

A Survey of Processing Systems for Phylogenetics and Population Genetics 系统发育与群体遗传学处理系统综述

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-03-16 DOI: 10.1145/3588033

Reinout Corts, Nikolaos S. Alachiotis

The COVID-19 pandemic brought Bioinformatics into the spotlight, revealing that several existing methods, algorithms, and tools were not well prepared to handle large amounts of genomic data efficiently. This led to prohibitively long execution times and the need to reduce the extent of analyses to obtain results in a reasonable amount of time. In this survey, we review available high-performance computing and hardware-accelerated systems based on FPGA and GPU technology. Optimized and hardware-accelerated systems can conduct more thorough analyses considerably faster than pure software implementations, allowing to reach important conclusions in a timely manner to drive scientific discoveries. We discuss the reasons that are currently hindering high-performance solutions from being widely deployed in real-world biological analyses and describe a research direction that can pave the way to enable this.

2019冠状病毒病大流行使生物信息学成为人们关注的焦点，揭示了一些现有的方法、算法和工具尚未为有效处理大量基因组数据做好充分准备。这导致执行时间过长，并且需要减少分析的范围，以便在合理的时间内获得结果。在本调查中，我们回顾了基于FPGA和GPU技术的现有高性能计算和硬件加速系统。优化和硬件加速的系统可以比纯软件实现更快地进行更彻底的分析，从而及时得出重要的结论，从而推动科学发现。我们讨论了目前阻碍高性能解决方案在现实世界生物分析中广泛应用的原因，并描述了一个可以为实现这一目标铺平道路的研究方向。

引用次数: 0

ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge ZyPR:端到端构建工具和运行时管理器，用于FPGA soc的边缘部分重新配置

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-02-27 DOI: 10.1145/3585521

Alex R. Bucknall, Suhaib A. Fahmy

Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.

部分重构(PR)是现代现场可编程门阵列(FPGA)片上系统(soc)自适应系统设计和开发的关键，它允许硬件在运行时动态适应。供应商支持的PR基础设施性能有限且阻塞，驱动程序需要复杂的内存管理，软件/硬件设计需要对底层硬件的定制知识。本文介绍了ZyPR:一个完整的端到端框架，从Linux用户空间的软件抽象中提供高性能的硬件重构，自动化构建PR应用程序的过程，支持Xilinx Zynq和Zynq UltraScale+架构，旨在使非专家应用程序设计人员能够利用边缘应用程序的PR。我们将ZyPR与传统的公关管理供应商工具以及最近在Linux下支持公关的开源工具进行比较。该框架提供了一个高性能的运行时，并为其提供的抽象提供了低开销。我们对之前的工作进行了改进，与Xilinx的FPGA Manager相比，Zynq Ultrascale+上PR位流的配置吞吐量提高了2倍和5.4倍。

{"title":"ZyPR: End-to-end Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge","authors":"Alex R. Bucknall, Suhaib A. Fahmy","doi":"10.1145/3585521","DOIUrl":"https://doi.org/10.1145/3585521","url":null,"abstract":"Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor-supported PR infrastructure is performance-limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This article presents ZyPR: a complete end-to-end framework that provides high-performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high-performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2× and 5.4× compared to Xilinx’s FPGA Manager.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 33"},"PeriodicalIF":2.3,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43056654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis AutoScaleDSE：用于高级综合的可扩展设计空间探索引擎

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-02-15 DOI: 10.1145/3572959

Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen

High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.

高级综合（HLS）使用户能够从设计的行为描述中快速开发针对FPGA的设计。然而，为了合成能够更好地利用目标FPGA的最佳设计，需要付出相当大的努力来将初始行为描述转换为能够捕获所需并行水平的形式。因此，需要一种能够优化大型复杂设计的设计空间探索（DSE）引擎来实现这一目标。我们提出了一种新的DSE引擎，它能够考虑代码转换、编译器指令（pragma）以及这些优化的兼容性。为了实现这一点，我们最初将输入代码的结构表示为图，以指导探索过程。为了适当地转换代码，我们利用了基于多级编译器基础设施（MLIR）的ScaleHLS。最后，我们发现了限制现有DSE可扩展性的问题，我们将其命名为“设计空间合并问题”。我们通过使用随机森林分类器来解决这个问题，该分类器可以成功地减少无效设计点的数量，而无需调用HLS编译器作为验证工具。我们将DSE引擎与ScaleHLS DSE进行了对比评估，最大性能超过了它59倍。此外，我们还通过将DSE应用于大规模HLS设计，展示了我们设计的可扩展性，为MachSuite和Rodinia集合中的基准实现了12倍的最大加速。

{"title":"AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis","authors":"Hyegang Jun, Hanchen Ye, Hyunmin Jeong, Deming Chen","doi":"10.1145/3572959","DOIUrl":"https://doi.org/10.1145/3572959","url":null,"abstract":"High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 30"},"PeriodicalIF":2.3,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45485527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Introduction to Special Section on FPT’20 FPT ' 20特别部分介绍

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-02-15 DOI: 10.1145/3579850

O. Sinnen, Qiang Liu, A. Davoodi

Remarn

引用次数: 0

Logic Shrinkage: Learned Connectivity Sparsification for LUT-Based Neural Networks 逻辑收缩：基于LUT的神经网络的学习连通性稀疏

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-02-10 DOI: 10.1145/3583075

Erwei Wang, Marie Auffret, G. Stavrou, P. Cheung, G. Constantinides, M. Abdelfattah, James J. Davis

FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this article, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs’ spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54 × and 1.31 ×, respectively, while matching its accuracy. This implementation also reaches 2.71 × the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67 × vs LUTNet, allowing for implementation that was previously impossible on today’s largest FPGAs. We validate the benefits of logic shrinkage in the context of real application deployment by implementing a face mask detection DNN using BNN, LUTNet and logic-shrunk layers. Our results show that logic shrinkage results in area gains versus LUTNet (up to 1.20 ×) and equally pruned BNNs (up to 1.08 ×), along with accuracy improvements.

使用原生lut作为独立可训练推理算子的fpga特定DNN架构已被证明可以实现良好的面积精度和能量精度权衡。该领域的第一个工作是LUTNet，它在标准深度神经网络基准测试中表现出了最先进的性能。在本文中，我们提出了这种基于lut的拓扑的学习优化，从而产生比直接使用现成的、手工设计的网络更高效率的设计。这类架构的现有实现需要手动规范每个LUT K的输入数量，选择适当的先验K是具有挑战性的，并且即使在高粒度(例如每层)这样做也是一个耗时且容易出错的过程，这使得fpga的空间灵活性未得到充分利用。此外，先前的工作看到LUT输入随机连接，这并不能保证良好的网络拓扑选择。为了解决这些问题，我们提出了逻辑收缩，一种细粒度的网络列表修剪方法，使K能够自动学习针对FPGA推理的神经网络中的每个LUT。通过去除被确定为不重要的LUT输入，我们的方法提高了所得加速器的效率。我们对LUT输入移除的gpu友好解决方案能够在训练期间处理大型拓扑，并且速度可以忽略不计。通过逻辑收缩，我们将性能最佳的LUTNet实现的CNV网络对CIFAR-10的分类面积和能量效率分别提高了1.54倍和1.31倍，同时与其精度相匹配。这种实现也达到2.71倍的面积效率，同样精确，严重修剪的BNN。在具有Bi-Real Net架构的ImageNet上，与LUTNet相比，使用逻辑收缩导致合成后面积减少2.67倍，从而允许在当今最大的fpga上实现以前不可能实现的功能。我们通过使用BNN、LUTNet和逻辑收缩层实现面罩检测DNN，在实际应用部署的背景下验证了逻辑收缩的好处。我们的研究结果表明，与LUTNet(高达1.20倍)和同等修剪的bnn(高达1.08倍)相比，逻辑收缩导致了面积增益，同时精度也得到了提高。

{"title":"Logic Shrinkage: Learned Connectivity Sparsification for LUT-Based Neural Networks","authors":"Erwei Wang, Marie Auffret, G. Stavrou, P. Cheung, G. Constantinides, M. Abdelfattah, James J. Davis","doi":"10.1145/3583075","DOIUrl":"https://doi.org/10.1145/3583075","url":null,"abstract":"FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this article, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs’ spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54 × and 1.31 ×, respectively, while matching its accuracy. This implementation also reaches 2.71 × the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67 × vs LUTNet, allowing for implementation that was previously impossible on today’s largest FPGAs. We validate the benefits of logic shrinkage in the context of real application deployment by implementing a face mask detection DNN using BNN, LUTNet and logic-shrunk layers. Our results show that logic shrinkage results in area gains versus LUTNet (up to 1.20 ×) and equally pruned BNNs (up to 1.08 ×), along with accuracy improvements.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"1 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43758566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster VCSN: HPC FPGA集群中灵活、简单通信的虚拟电路交换网络

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-01-13 DOI: 10.1145/3579848

Tomohiro Ueno, K. Sano

FPGA clusters promise to play a critical role in high-performance computing (HPC) systems in the near future due to their flexibility and high power efficiency. The operation of large-scale general-purpose FPGA clusters on which multiple users run diverse applications requires flexible network topology to be divided and reconfigured. This paper proposes Virtual Circuit-Switching Network (VCSN) that provides an arbitrarily reconfigurable network topology and simple-to-operate network system among FPGA nodes. With virtualization, user logic on FPGAs can communicate with each other as if a circuit-switching network was available. This paper demonstrates that VCSN with 100 Gbps Ethernet achieves highly-efficient point-to-point communication among FPGAs due to its unique and efficient communication protocol. We compare VCSN with a direct connection network (DCN) that connects FPGAs directly. We also show a concrete procedure to realize collective communication on an FPGA cluster with VCSN. We demonstrate that the flexible virtual topology provided by VCSN can accelerate collective communication with simple operations. Furthermore, based on experimental results, we model and estimate communication performance by DCN and VCSN in a large FPGA cluster. The result shows that VCSN has the potential to accelerate gather communication up to about 1.97 times more than DCN.

FPGA集群由于其灵活性和高功率效率，有望在不久的将来在高性能计算（HPC）系统中发挥关键作用。多个用户运行不同应用程序的大规模通用FPGA集群的操作需要灵活的网络拓扑结构进行划分和重新配置。本文提出了一种虚拟电路交换网络（VCSN），它在FPGA节点之间提供了一种可任意重构的网络拓扑和操作简单的网络系统。通过虚拟化，FPGA上的用户逻辑可以像电路交换网络一样相互通信。本文证明，采用100 Gbps以太网的VCSN由于其独特高效的通信协议，实现了FPGA之间高效的点对点通信。我们将VCSN与直接连接FPGA的直接连接网络（DCN）进行比较。我们还展示了用VCSN在FPGA集群上实现集体通信的具体过程。我们证明了VCSN提供的灵活的虚拟拓扑可以通过简单的操作加速集体通信。此外，基于实验结果，我们对DCN和VCSN在大型FPGA集群中的通信性能进行了建模和估计。结果表明，VCSN具有比DCN高1.97倍的加速聚集通信的潜力。

{"title":"VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster","authors":"Tomohiro Ueno, K. Sano","doi":"10.1145/3579848","DOIUrl":"https://doi.org/10.1145/3579848","url":null,"abstract":"FPGA clusters promise to play a critical role in high-performance computing (HPC) systems in the near future due to their flexibility and high power efficiency. The operation of large-scale general-purpose FPGA clusters on which multiple users run diverse applications requires flexible network topology to be divided and reconfigured. This paper proposes Virtual Circuit-Switching Network (VCSN) that provides an arbitrarily reconfigurable network topology and simple-to-operate network system among FPGA nodes. With virtualization, user logic on FPGAs can communicate with each other as if a circuit-switching network was available. This paper demonstrates that VCSN with 100 Gbps Ethernet achieves highly-efficient point-to-point communication among FPGAs due to its unique and efficient communication protocol. We compare VCSN with a direct connection network (DCN) that connects FPGAs directly. We also show a concrete procedure to realize collective communication on an FPGA cluster with VCSN. We demonstrate that the flexible virtual topology provided by VCSN can accelerate collective communication with simple operations. Furthermore, based on experimental results, we model and estimate communication performance by DCN and VCSN in a large FPGA cluster. The result shows that VCSN has the potential to accelerate gather communication up to about 1.97 times more than DCN.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46978447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Deterministic Approach for Range-enhanced Reconfigurable Packet Classification Engine 范围增强可重构包分类引擎的确定性方法

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2023-01-01 DOI: 10.1145/3586577

M. Dhayalakumar, S. Mahammad

引用次数: 0

A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application RoCEv2远程DMA协议的高吞吐量、资源高效实现及其应用

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2022-12-22 DOI: 10.1145/3543176

Niklas Schelten, Fritjof Steinert, Justin Knapheide, Anton Schulte, B. Stabernack

The use of application-specific accelerators in data centers has been the state of the art for at least a decade, starting with the availability of General Purpose GPUs achieving higher performance either overall or per watt. In most cases, these accelerators are coupled via PCIe interfaces to the corresponding hosts, which leads to disadvantages in interoperability, scalability and power consumption. As a viable alternative to PCIe-attached FPGA accelerators this paper proposes standalone FPGAs as Network-attached Accelerators (NAAs). To enable reliable communication for decoupled FPGAs we present an RDMA over Converged Ethernet v2 (RoCEv2) communication stack for high-speed and low-latency data transfer integrated into a hardware framework. For NAAs to be used instead of PCIe coupled FPGAs the framework must provide similar throughput and latency with low resource usage. We show that our RoCEv2 stack is capable of achieving 100 Gb/s throughput with latencies of less than 4μs while using about 10% of the available resources on a mid-range FPGA. To evaluate the energy efficiency of our NAA architecture, we built a demonstrator with 8 NAAs for machine learning based image classification. Based on our measurements, network-attached FPGAs are a great alternative to the more energy-demanding PCIe-attached FPGA accelerators.

从通用GPU的可用性开始，在数据中心中使用特定于应用程序的加速器已经成为最先进的技术至少十年了，无论是总体性能还是每瓦性能。在大多数情况下，这些加速器通过PCIe接口耦合到相应的主机，这导致了互操作性、可扩展性和功耗方面的缺点。作为PCIe连接FPGA加速器的可行替代方案，本文提出了独立的FPGA作为网络连接加速器（NAAs）。为了实现解耦FPGA的可靠通信，我们提出了一种RDMA over Converged Ethernet v2（RoCEv2）通信堆栈，用于集成到硬件框架中的高速低延迟数据传输。对于要使用NAA而不是PCIe耦合的FPGA，框架必须以低资源使用率提供类似的吞吐量和延迟。我们证明，我们的RoCEv2堆栈能够实现100 Gb/s的吞吐量，延迟小于4μs，同时在中端FPGA上使用约10%的可用资源。为了评估我们的NAA架构的能量效率，我们构建了一个具有8个NAA的演示器，用于基于机器学习的图像分类。根据我们的测量，网络连接的FPGA是对能源要求更高的PCIe连接FPGA加速器的一个很好的替代方案。

{"title":"A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application","authors":"Niklas Schelten, Fritjof Steinert, Justin Knapheide, Anton Schulte, B. Stabernack","doi":"10.1145/3543176","DOIUrl":"https://doi.org/10.1145/3543176","url":null,"abstract":"The use of application-specific accelerators in data centers has been the state of the art for at least a decade, starting with the availability of General Purpose GPUs achieving higher performance either overall or per watt. In most cases, these accelerators are coupled via PCIe interfaces to the corresponding hosts, which leads to disadvantages in interoperability, scalability and power consumption. As a viable alternative to PCIe-attached FPGA accelerators this paper proposes standalone FPGAs as Network-attached Accelerators (NAAs). To enable reliable communication for decoupled FPGAs we present an RDMA over Converged Ethernet v2 (RoCEv2) communication stack for high-speed and low-latency data transfer integrated into a hardware framework. For NAAs to be used instead of PCIe coupled FPGAs the framework must provide similar throughput and latency with low resource usage. We show that our RoCEv2 stack is capable of achieving 100 Gb/s throughput with latencies of less than 4μs while using about 10% of the available resources on a mid-range FPGA. To evaluate the energy efficiency of our NAA architecture, we built a demonstrator with 8 NAAs for machine learning based image classification. Based on our measurements, network-attached FPGAs are a great alternative to the more energy-demanding PCIe-attached FPGA accelerators.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 23"},"PeriodicalIF":2.3,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43810654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA FlexCNN:在FPGA上组成CNN加速器的端到端框架

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2022-12-20 DOI: 10.1145/3570928

Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, J. Cong

With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX1 representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.

随着数据重用和并行性的降低，卷积神经网络(cnn)对FPGA加速提出了新的挑战。收缩阵列(Systolic arrays, SAs)是卷积层的高效、可扩展架构，但如果没有适当的优化，其效率会急剧下降，原因有:(1)同类型层中不同的维度，(2)不同的卷积层，特别是转置卷积和展开卷积，(3)CNN复杂的数据流图。此外，当将fpga集成到机器学习框架中时，会产生显著的开销。因此，我们提出了一种灵活的、可组合的架构，称为FlexCNN，它通过采用动态平铺、层融合和数据布局优化来提供高计算效率。此外，我们实现了一种新的通用SA来有效地处理法向卷积、转置卷积和扩展卷积。FlexCNN还采用了完全流水线化的软硬件集成，减轻了软件开销。此外，FlexCNN采用自动编译流程，采用ONNX1表示的CNN，进行设计空间探索，并生成FPGA加速器。该框架使用三个复杂的cnn进行了测试:OpenPose, U-Net和E-Net。架构优化实现了2.3倍的性能提升。与标准SA相比，多功能SA实现了接近理想的加速，转置和扩张卷积的加速分别为5.98倍和13.42倍，平均面积开销为6%。流水线集成使OpenPose的速度提高了5倍。

{"title":"FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA","authors":"Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, J. Cong","doi":"10.1145/3570928","DOIUrl":"https://doi.org/10.1145/3570928","url":null,"abstract":"With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX1 representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2022-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42093076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Introduction to the Special Section on FPL 2020 FPL 2020特别部分介绍

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems

Pub Date : 2022-12-14 DOI: 10.1145/3536336

N. Mentens, Lionel Sousa, P. Trancoso

The International Conference on Field Programmable Logic and Applications (FPL) was the first and remains the largest conference in the important area of field-programmable logic and reconfigurable computing. The 30th edition of FPL was scheduled to be from August 31 to September 4, 2020, in the Chalmers Conference Center in Gothenburg, Sweden, but was moved to a virtual format due to the coronavirus disease (COVID-19). From 158 submissions, the program committee selected 24 full papers and 28 short papers to be presented in the conference. The FPL Program coChairs invited the authors of the best papers to submit an extended version of their FPL published work for composing a Special Issue of the ACM Transactions on Reconfigurable Technology and Systems. Six extended articles that went through a completely new review process have been accepted to be published in this Special Issue. These articles bring new results of research efforts in reconfigurable computing, in the areas of placement and connection of nodes and hard-blocks, nearmemory processing and HBM, NoCs, and aging in FPGAs. We acknowledge the support of all reviewers, which are fundamental in the article selection process, also for giving valuable suggestions to the authors. Thanks also go to the authors who submitted articles, and to the ACM TRETS support team. We also thank Professor Deming Chen, Editor-in-Chief of ACM TRETS, for hosting this special issue. The article Exploiting HBM on FPGAs for Data Processing focuses on the potential to exploit High Bandwidth Memory (HBM) for FPGA acceleration of data analytics workloads. The authors investigate different aspects of the computation as well as data partitioning and placement. For the evaluation of the FPGA+HBM setup, the authors integrate into an in-memory database system three relevant workloads: range selection, hash join, and stochastic gradient descent. The results show large performance benefits (6–18×) of the proposed approach when compared to traditional server systems used for the same workloads justifying the use of HBM for FPGA accelerators for these workloads. The article Detailed Placement for Dedicated LUT-level FPGA Interconnect studies the impact of dedicated placement on FPGA architectures with direct connections between the Look-Up Tables (LUTs). The authors propose a novel algorithm that orchestrates different Linear Programs (LPs)

现场可编程逻辑和应用国际会议(FPL)是第一个也是最大的现场可编程逻辑和可重构计算重要领域的会议。第30届FPL会议原定于2020年8月31日至9月4日在瑞典哥德堡查尔姆斯会议中心举行，但由于新型冠状病毒感染症(COVID-19)的影响，改为虚拟形式。从158份提交的论文中，计划委员会选出了24篇全文论文和28篇短文在会议上发表。FPL项目联合主席邀请最佳论文的作者提交其FPL发表作品的扩展版本，以组成ACM可重构技术与系统交易的特刊。六篇经过全新审查程序的延伸文章已被接受在本期特刊上发表。这些文章带来了可重构计算、节点和硬块的放置和连接、近内存处理和HBM、noc和fpga老化等领域的新研究成果。我们感谢所有审稿人的支持，他们是文章选择过程中的基础，也为作者提供了宝贵的建议。还要感谢提交文章的作者和ACM TRETS支持团队。我们也感谢ACM TRETS总编辑陈德明教授主持本期特刊。利用FPGA上的HBM进行数据处理这篇文章重点介绍了利用高带宽内存(HBM)用于FPGA加速数据分析工作负载的潜力。作者研究了计算的不同方面以及数据划分和放置。为了评估FPGA+HBM设置，作者将三个相关工作负载集成到内存数据库系统中:范围选择，哈希连接和随机梯度下降。结果表明，与用于相同工作负载的传统服务器系统相比，所提出的方法具有很大的性能优势(6 - 18倍)，证明了在这些工作负载中使用HBM作为FPGA加速器是合理的。文章《专用lut级FPGA互连的详细放置》研究了专用放置对具有查找表(lut)之间直接连接的FPGA架构的影响。作者提出了一种新的算法来协调不同的线性规划(lp)。

{"title":"Introduction to the Special Section on FPL 2020","authors":"N. Mentens, Lionel Sousa, P. Trancoso","doi":"10.1145/3536336","DOIUrl":"https://doi.org/10.1145/3536336","url":null,"abstract":"The International Conference on Field Programmable Logic and Applications (FPL) was the first and remains the largest conference in the important area of field-programmable logic and reconfigurable computing. The 30th edition of FPL was scheduled to be from August 31 to September 4, 2020, in the Chalmers Conference Center in Gothenburg, Sweden, but was moved to a virtual format due to the coronavirus disease (COVID-19). From 158 submissions, the program committee selected 24 full papers and 28 short papers to be presented in the conference. The FPL Program coChairs invited the authors of the best papers to submit an extended version of their FPL published work for composing a Special Issue of the ACM Transactions on Reconfigurable Technology and Systems. Six extended articles that went through a completely new review process have been accepted to be published in this Special Issue. These articles bring new results of research efforts in reconfigurable computing, in the areas of placement and connection of nodes and hard-blocks, nearmemory processing and HBM, NoCs, and aging in FPGAs. We acknowledge the support of all reviewers, which are fundamental in the article selection process, also for giving valuable suggestions to the authors. Thanks also go to the authors who submitted articles, and to the ACM TRETS support team. We also thank Professor Deming Chen, Editor-in-Chief of ACM TRETS, for hosting this special issue. The article Exploiting HBM on FPGAs for Data Processing focuses on the potential to exploit High Bandwidth Memory (HBM) for FPGA acceleration of data analytics workloads. The authors investigate different aspects of the computation as well as data partitioning and placement. For the evaluation of the FPGA+HBM setup, the authors integrate into an in-memory database system three relevant workloads: range selection, hash join, and stochastic gradient descent. The results show large performance benefits (6–18×) of the proposed approach when compared to traditional server systems used for the same workloads justifying the use of HBM for FPGA accelerators for these workloads. The article Detailed Placement for Dedicated LUT-level FPGA Interconnect studies the impact of dedicated placement on FPGA architectures with direct connections between the Look-Up Tables (LUTs). The authors propose a novel algorithm that orchestrates different Linear Programs (LPs)","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"15 1","pages":"1 - 2"},"PeriodicalIF":2.3,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43627321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0