Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy

IF 4.1 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Internet Technology Pub Date : 2023-02-23 DOI:https://dl.acm.org/doi/10.1145/3546192

Jianwei Hao, Piyush Subedi, Lakshmish Ramaswamy, In Kee Kim

{"title":"Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy","authors":"Jianwei Hao, Piyush Subedi, Lakshmish Ramaswamy, In Kee Kim","doi":"https://dl.acm.org/doi/10.1145/3546192","DOIUrl":null,"url":null,"abstract":"<p>The wide adoption of smart devices and Internet-of-Things (IoT) sensors has led to massive growth in data generation at the edge of the Internet over the past decade. Intelligent real-time analysis of such a high volume of data, particularly leveraging highly accurate deep learning (DL) models, often requires the data to be processed as close to the data sources (or at the edge of the Internet) to minimize the network and processing latency. The advent of specialized, low-cost, and power-efficient edge devices has greatly facilitated DL inference tasks at the edge. However, limited research has been done to improve the inference throughput (e.g., number of inferences per second) by exploiting various system techniques. This study investigates system techniques, such as batched inferencing, AI multi-tenancy, and cluster of AI accelerators, which can significantly enhance the overall inference throughput on edge devices with DL models for image classification tasks. In particular, AI multi-tenancy enables collective utilization of edge devices’ system resources (CPU, GPU) and AI accelerators (e.g., Edge Tensor Processing Units; EdgeTPUs). The evaluation results show that batched inferencing results in more than 2.4× throughput improvement on devices equipped with high-performance GPUs like Jetson Xavier NX. Moreover, with multi-tenancy approaches, e.g., concurrent model executions (CME) and dynamic model placements (DMP), the DL inference throughput on edge devices (with GPUs) and EdgeTPU can be further improved by up to 3× and 10×, respectively. Furthermore, we present a detailed analysis of hardware and software factors that change the DL inference throughput on edge devices and EdgeTPUs, thereby shedding light on areas that could be further improved to achieve high-performance DL inference at the edge.</p>","PeriodicalId":50911,"journal":{"name":"ACM Transactions on Internet Technology","volume":"17 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Internet Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3546192","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The wide adoption of smart devices and Internet-of-Things (IoT) sensors has led to massive growth in data generation at the edge of the Internet over the past decade. Intelligent real-time analysis of such a high volume of data, particularly leveraging highly accurate deep learning (DL) models, often requires the data to be processed as close to the data sources (or at the edge of the Internet) to minimize the network and processing latency. The advent of specialized, low-cost, and power-efficient edge devices has greatly facilitated DL inference tasks at the edge. However, limited research has been done to improve the inference throughput (e.g., number of inferences per second) by exploiting various system techniques. This study investigates system techniques, such as batched inferencing, AI multi-tenancy, and cluster of AI accelerators, which can significantly enhance the overall inference throughput on edge devices with DL models for image classification tasks. In particular, AI multi-tenancy enables collective utilization of edge devices’ system resources (CPU, GPU) and AI accelerators (e.g., Edge Tensor Processing Units; EdgeTPUs). The evaluation results show that batched inferencing results in more than 2.4× throughput improvement on devices equipped with high-performance GPUs like Jetson Xavier NX. Moreover, with multi-tenancy approaches, e.g., concurrent model executions (CME) and dynamic model placements (DMP), the DL inference throughput on edge devices (with GPUs) and EdgeTPU can be further improved by up to 3× and 10×, respectively. Furthermore, we present a detailed analysis of hardware and software factors that change the DL inference throughput on edge devices and EdgeTPUs, thereby shedding light on areas that could be further improved to achieve high-performance DL inference at the edge.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

触及天空:利用AI多租户最大化边缘设备上的深度学习推理吞吐量

在过去十年中，智能设备和物联网(IoT)传感器的广泛采用导致了互联网边缘数据生成的大规模增长。对如此大量的数据进行智能实时分析，特别是利用高度精确的深度学习(DL)模型，通常需要在靠近数据源(或在互联网边缘)的地方处理数据，以最大限度地减少网络和处理延迟。专业、低成本和节能的边缘设备的出现极大地促进了边缘的深度学习推理任务。然而，通过利用各种系统技术来提高推理吞吐量(例如，每秒推理次数)的研究有限。本研究探讨了批处理推理、人工智能多租户和人工智能加速器集群等系统技术，这些技术可以显著提高边缘设备上使用深度学习模型进行图像分类任务的整体推理吞吐量。特别是，AI多租户允许集体利用边缘设备的系统资源(CPU, GPU)和AI加速器(例如，边缘张量处理单元;EdgeTPUs)。评估结果表明，在配备Jetson Xavier NX等高性能gpu的设备上，批处理推理使吞吐量提高了2.4倍以上。此外，通过多租户方法，例如并发模型执行(CME)和动态模型放置(DMP)，边缘设备(带有gpu)和EdgeTPU上的DL推理吞吐量可以分别进一步提高3倍和10倍。此外，我们还详细分析了改变边缘设备和edgetpu上深度学习推理吞吐量的硬件和软件因素，从而揭示了可以进一步改进的领域，以在边缘实现高性能深度学习推理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Internet Technology 工程技术-计算机：软件工程

CiteScore

10.30

自引率

1.90%

发文量

137

审稿时长

>12 weeks

期刊介绍： ACM Transactions on Internet Technology (TOIT) brings together many computing disciplines including computer software engineering, computer programming languages, middleware, database management, security, knowledge discovery and data mining, networking and distributed systems, communications, performance and scalability etc. TOIT will cover the results and roles of the individual disciplines and the relationshipsamong them.