Specializing for Efficiency: Customizing AI Inference Processors on FPGAs

Andrew Boutros, E. Nurvitadhi, Vaughn Betz
{"title":"Specializing for Efficiency: Customizing AI Inference Processors on FPGAs","authors":"Andrew Boutros, E. Nurvitadhi, Vaughn Betz","doi":"10.1109/ICM52667.2021.9664938","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) has become an essential component in modern datacenter applications. The high computational complexity of AI algorithms and the stringent latency constraints for datacenter workloads necessitate the use of efficient specialized AI accelerators. However, the rapid changes in state-of-the-art AI algorithms as well as their varying compute and memory demands challenge accelerator deployments in datacenters as a result of the much slower hardware development cycle. To this end, field-programmable gate arrays (FPGAs) offer the necessary adaptability along with the desired custom hardware efficiency. However, FPGA design is non-trivial; it requires deep hardware expertise and suffers from long compile and debug times, making FPGAs difficult to use for software-oriented AI application developers. AI inference soft processor overlays address this by allowing application developers to write their AI algorithms in a high-level programming language, which are then compiled into instructions to be executed on an AI-targeted soft processor implemented on the FPGA. While the generality of such overlays can eliminate the long bitstream compile times and make FPGAs more accessible for application developers, some classes of the target workloads do not fully utilize the overlay resources resulting in sub-optimal efficiency. In this paper, we investigate the trade-off between hardware efficiency and designer productivity by quantifying the gains and costs of specializing overlays for different classes of AI workloads. We show that per-workload specialized variants of the neural processing unit (NPU), a state-of-the-art AI inference overlay, can achieve up to 41% better performance and 44% area savings.","PeriodicalId":212613,"journal":{"name":"2021 International Conference on Microelectronics (ICM)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Microelectronics (ICM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICM52667.2021.9664938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Artificial intelligence (AI) has become an essential component in modern datacenter applications. The high computational complexity of AI algorithms and the stringent latency constraints for datacenter workloads necessitate the use of efficient specialized AI accelerators. However, the rapid changes in state-of-the-art AI algorithms as well as their varying compute and memory demands challenge accelerator deployments in datacenters as a result of the much slower hardware development cycle. To this end, field-programmable gate arrays (FPGAs) offer the necessary adaptability along with the desired custom hardware efficiency. However, FPGA design is non-trivial; it requires deep hardware expertise and suffers from long compile and debug times, making FPGAs difficult to use for software-oriented AI application developers. AI inference soft processor overlays address this by allowing application developers to write their AI algorithms in a high-level programming language, which are then compiled into instructions to be executed on an AI-targeted soft processor implemented on the FPGA. While the generality of such overlays can eliminate the long bitstream compile times and make FPGAs more accessible for application developers, some classes of the target workloads do not fully utilize the overlay resources resulting in sub-optimal efficiency. In this paper, we investigate the trade-off between hardware efficiency and designer productivity by quantifying the gains and costs of specializing overlays for different classes of AI workloads. We show that per-workload specialized variants of the neural processing unit (NPU), a state-of-the-art AI inference overlay, can achieve up to 41% better performance and 44% area savings.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
专注于效率:在fpga上定制AI推理处理器
人工智能(AI)已成为现代数据中心应用的重要组成部分。人工智能算法的高计算复杂性和数据中心工作负载的严格延迟限制需要使用高效的专业人工智能加速器。然而,由于硬件开发周期较慢,最先进的人工智能算法的快速变化以及它们不同的计算和内存需求对数据中心中的加速器部署提出了挑战。为此,现场可编程门阵列(fpga)提供了必要的适应性以及所需的定制硬件效率。然而,FPGA的设计是不平凡的;它需要深厚的硬件专业知识,并且需要很长的编译和调试时间,这使得fpga难以用于面向软件的AI应用程序开发人员。人工智能推理软处理器覆盖层通过允许应用程序开发人员用高级编程语言编写他们的人工智能算法来解决这个问题,然后将其编译成指令,在FPGA上实现的针对人工智能的软处理器上执行。虽然这种覆盖的通用性可以消除长比特流编译时间,并使fpga更易于应用程序开发人员访问,但某些类别的目标工作负载没有充分利用覆盖资源,导致效率次优。在本文中,我们通过量化不同类别人工智能工作负载的专业化覆盖的收益和成本来研究硬件效率和设计师生产力之间的权衡。我们表明,神经处理单元(NPU)的每个工作负载专用变体(最先进的人工智能推理覆盖层)可以实现高达41%的性能提升和44%的面积节省。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hardware Implementation of Yolov4-tiny for Object Detection Comparative Study of Different Activation Functions for Anomalous Sound Detection Speed Up Functional Coverage Closure of CORDIC Designs Using Machine Learning Models Lightweight Image Encryption: Cellular Automata and the Lorenz System Double Gate TFET with Germanium Pocket and Metal drain using Dual Oxide
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1