Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy

IF 3.9 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Internet Technology Pub Date : 2023-02-23 DOI:https://dl.acm.org/doi/10.1145/3546192
Jianwei Hao, Piyush Subedi, Lakshmish Ramaswamy, In Kee Kim
{"title":"Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy","authors":"Jianwei Hao, Piyush Subedi, Lakshmish Ramaswamy, In Kee Kim","doi":"https://dl.acm.org/doi/10.1145/3546192","DOIUrl":null,"url":null,"abstract":"<p>The wide adoption of smart devices and Internet-of-Things (IoT) sensors has led to massive growth in data generation at the edge of the Internet over the past decade. Intelligent real-time analysis of such a high volume of data, particularly leveraging highly accurate deep learning (DL) models, often requires the data to be processed as close to the data sources (or at the edge of the Internet) to minimize the network and processing latency. The advent of specialized, low-cost, and power-efficient edge devices has greatly facilitated DL inference tasks at the edge. However, limited research has been done to improve the inference throughput (e.g., number of inferences per second) by exploiting various system techniques. This study investigates system techniques, such as batched inferencing, AI multi-tenancy, and cluster of AI accelerators, which can significantly enhance the overall inference throughput on edge devices with DL models for image classification tasks. In particular, AI multi-tenancy enables collective utilization of edge devices’ system resources (CPU, GPU) and AI accelerators (e.g., Edge Tensor Processing Units; EdgeTPUs). The evaluation results show that batched inferencing results in more than 2.4× throughput improvement on devices equipped with high-performance GPUs like Jetson Xavier NX. Moreover, with multi-tenancy approaches, e.g., concurrent model executions (CME) and dynamic model placements (DMP), the DL inference throughput on edge devices (with GPUs) and EdgeTPU can be further improved by up to 3× and 10×, respectively. Furthermore, we present a detailed analysis of hardware and software factors that change the DL inference throughput on edge devices and EdgeTPUs, thereby shedding light on areas that could be further improved to achieve high-performance DL inference at the edge.</p>","PeriodicalId":50911,"journal":{"name":"ACM Transactions on Internet Technology","volume":"17 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Internet Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3546192","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

The wide adoption of smart devices and Internet-of-Things (IoT) sensors has led to massive growth in data generation at the edge of the Internet over the past decade. Intelligent real-time analysis of such a high volume of data, particularly leveraging highly accurate deep learning (DL) models, often requires the data to be processed as close to the data sources (or at the edge of the Internet) to minimize the network and processing latency. The advent of specialized, low-cost, and power-efficient edge devices has greatly facilitated DL inference tasks at the edge. However, limited research has been done to improve the inference throughput (e.g., number of inferences per second) by exploiting various system techniques. This study investigates system techniques, such as batched inferencing, AI multi-tenancy, and cluster of AI accelerators, which can significantly enhance the overall inference throughput on edge devices with DL models for image classification tasks. In particular, AI multi-tenancy enables collective utilization of edge devices’ system resources (CPU, GPU) and AI accelerators (e.g., Edge Tensor Processing Units; EdgeTPUs). The evaluation results show that batched inferencing results in more than 2.4× throughput improvement on devices equipped with high-performance GPUs like Jetson Xavier NX. Moreover, with multi-tenancy approaches, e.g., concurrent model executions (CME) and dynamic model placements (DMP), the DL inference throughput on edge devices (with GPUs) and EdgeTPU can be further improved by up to 3× and 10×, respectively. Furthermore, we present a detailed analysis of hardware and software factors that change the DL inference throughput on edge devices and EdgeTPUs, thereby shedding light on areas that could be further improved to achieve high-performance DL inference at the edge.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
触及天空:利用AI多租户最大化边缘设备上的深度学习推理吞吐量
在过去十年中,智能设备和物联网(IoT)传感器的广泛采用导致了互联网边缘数据生成的大规模增长。对如此大量的数据进行智能实时分析,特别是利用高度精确的深度学习(DL)模型,通常需要在靠近数据源(或在互联网边缘)的地方处理数据,以最大限度地减少网络和处理延迟。专业、低成本和节能的边缘设备的出现极大地促进了边缘的深度学习推理任务。然而,通过利用各种系统技术来提高推理吞吐量(例如,每秒推理次数)的研究有限。本研究探讨了批处理推理、人工智能多租户和人工智能加速器集群等系统技术,这些技术可以显著提高边缘设备上使用深度学习模型进行图像分类任务的整体推理吞吐量。特别是,AI多租户允许集体利用边缘设备的系统资源(CPU, GPU)和AI加速器(例如,边缘张量处理单元;EdgeTPUs)。评估结果表明,在配备Jetson Xavier NX等高性能gpu的设备上,批处理推理使吞吐量提高了2.4倍以上。此外,通过多租户方法,例如并发模型执行(CME)和动态模型放置(DMP),边缘设备(带有gpu)和EdgeTPU上的DL推理吞吐量可以分别进一步提高3倍和10倍。此外,我们还详细分析了改变边缘设备和edgetpu上深度学习推理吞吐量的硬件和软件因素,从而揭示了可以进一步改进的领域,以在边缘实现高性能深度学习推理。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Internet Technology
ACM Transactions on Internet Technology 工程技术-计算机:软件工程
CiteScore
10.30
自引率
1.90%
发文量
137
审稿时长
>12 weeks
期刊介绍: ACM Transactions on Internet Technology (TOIT) brings together many computing disciplines including computer software engineering, computer programming languages, middleware, database management, security, knowledge discovery and data mining, networking and distributed systems, communications, performance and scalability etc. TOIT will cover the results and roles of the individual disciplines and the relationshipsamong them.
期刊最新文献
Interpersonal Communication Interconnection in Media Convergence Metaverse Using Reinforcement Learning and Error Models for Drone Precision Landing Towards Human-AI Teaming to Mitigate Alert Fatigue in Security Operations Centres RESP: A Recursive Clustering Approach for Edge Server Placement in Mobile Edge Computing OTI-IoT: A Blockchain-based Operational Threat Intelligence Framework for Multi-vector DDoS Attacks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1