A FPGA Accelerator of Distributed A3C Algorithm with Optimal Resource Deployment

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IET Computers and Digital Techniques Pub Date : 2024-05-27 DOI:10.1049/2024/7855250

Fen Ge, Guohui Zhang, Ziyu Li, Fang Zhou

{"title":"A FPGA Accelerator of Distributed A3C Algorithm with Optimal Resource Deployment","authors":"Fen Ge, Guohui Zhang, Ziyu Li, Fang Zhou","doi":"10.1049/2024/7855250","DOIUrl":null,"url":null,"abstract":"<p>The asynchronous advantage actor-critic (A3C) algorithm is widely regarded as one of the most effective and powerful algorithms among various deep reinforcement learning algorithms. However, the distributed and asynchronous nature of the A3C algorithm brings increased algorithm complexity and computational requirements, which not only leads to an increased training cost but also amplifies the difficulty of deploying the algorithm on resource-limited field programmable gate array (FPGA) platforms. In addition, the resource wastage problem caused by the distributed training characteristics of A3C algorithms and the resource allocation problem affected by the imbalance between the computational amount of inference and training need to be carefully considered when designing accelerators. In this paper, we introduce a deployment strategy designed for distributed algorithms aimed at enhancing the resource utilization of hardware devices. Subsequently, a FPGA architecture is constructed specifically for accelerating the inference and training processes of the A3C algorithm. The experimental results show that our proposed deployment strategy reduces resource consumption by 62.5% and decreases the number of agents waiting for training by 32.2%, and the proposed A3C accelerator achieves 1.83× and 2.39× improvements in speedup compared to CPU (Intel i9-13900K) and GPU (NVIDIA RTX 4090) with less power consumption respectively. Furthermore, our design shows superior resource efficiency compared to existing works.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"2024 1","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/7855250","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computers and Digital Techniques","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/2024/7855250","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

The asynchronous advantage actor-critic (A3C) algorithm is widely regarded as one of the most effective and powerful algorithms among various deep reinforcement learning algorithms. However, the distributed and asynchronous nature of the A3C algorithm brings increased algorithm complexity and computational requirements, which not only leads to an increased training cost but also amplifies the difficulty of deploying the algorithm on resource-limited field programmable gate array (FPGA) platforms. In addition, the resource wastage problem caused by the distributed training characteristics of A3C algorithms and the resource allocation problem affected by the imbalance between the computational amount of inference and training need to be carefully considered when designing accelerators. In this paper, we introduce a deployment strategy designed for distributed algorithms aimed at enhancing the resource utilization of hardware devices. Subsequently, a FPGA architecture is constructed specifically for accelerating the inference and training processes of the A3C algorithm. The experimental results show that our proposed deployment strategy reduces resource consumption by 62.5% and decreases the number of agents waiting for training by 32.2%, and the proposed A3C accelerator achieves 1.83× and 2.39× improvements in speedup compared to CPU (Intel i9-13900K) and GPU (NVIDIA RTX 4090) with less power consumption respectively. Furthermore, our design shows superior resource efficiency compared to existing works.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

优化资源调配的分布式 A3C 算法 FPGA 加速器

异步优势行动者批判（A3C）算法被广泛认为是各种深度强化学习算法中最有效、最强大的算法之一。然而，A3C 算法的分布式和异步特性增加了算法的复杂性和计算要求，不仅导致训练成本增加，而且加大了在资源有限的现场可编程门阵列（FPGA）平台上部署该算法的难度。此外，A3C 算法的分布式训练特性所造成的资源浪费问题，以及推理计算量和训练计算量不平衡所影响的资源分配问题，都需要在设计加速器时仔细考虑。本文介绍了一种为分布式算法设计的部署策略，旨在提高硬件设备的资源利用率。随后，我们构建了一种 FPGA 架构，专门用于加速 A3C 算法的推理和训练过程。实验结果表明，我们提出的部署策略降低了 62.5% 的资源消耗，减少了 32.2% 的等待训练的代理数量，与功耗更低的 CPU（英特尔 i9-13900K）和 GPU（英伟达 RTX 4090）相比，我们提出的 A3C 加速器的速度分别提高了 1.83 倍和 2.39 倍。此外，与现有作品相比，我们的设计显示出更高的资源效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IET Computers and Digital Techniques 工程技术-计算机：理论方法

CiteScore

3.50

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： IET Computers & Digital Techniques publishes technical papers describing recent research and development work in all aspects of digital system-on-chip design and test of electronic and embedded systems, including the development of design automation tools (methodologies, algorithms and architectures). Papers based on the problems associated with the scaling down of CMOS technology are particularly welcome. It is aimed at researchers, engineers and educators in the fields of computer and digital systems design and test. The key subject areas of interest are: Design Methods and Tools: CAD/EDA tools, hardware description languages, high-level and architectural synthesis, hardware/software co-design, platform-based design, 3D stacking and circuit design, system on-chip architectures and IP cores, embedded systems, logic synthesis, low-power design and power optimisation. Simulation, Test and Validation: electrical and timing simulation, simulation based verification, hardware/software co-simulation and validation, mixed-domain technology modelling and simulation, post-silicon validation, power analysis and estimation, interconnect modelling and signal integrity analysis, hardware trust and security, design-for-testability, embedded core testing, system-on-chip testing, on-line testing, automatic test generation and delay testing, low-power testing, reliability, fault modelling and fault tolerance. Processor and System Architectures: many-core systems, general-purpose and application specific processors, computational arithmetic for DSP applications, arithmetic and logic units, cache memories, memory management, co-processors and accelerators, systems and networks on chip, embedded cores, platforms, multiprocessors, distributed systems, communication protocols and low-power issues. Configurable Computing: embedded cores, FPGAs, rapid prototyping, adaptive computing, evolvable and statically and dynamically reconfigurable and reprogrammable systems, reconfigurable hardware. Design for variability, power and aging: design methods for variability, power and aging aware design, memories, FPGAs, IP components, 3D stacking, energy harvesting. Case Studies: emerging applications, applications in industrial designs, and design frameworks.