基于云-边缘协同的多压缩尺度DNN推理加速

IF 2.6 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Embedded Computing Systems Pub Date : 2023-11-28 DOI:10.1145/3634704

Huamei Qi, Fang Ren, Leilei Wang, Ping Jiang, Shaohua Wan, Xiaoheng Deng

{"title":"基于云-边缘协同的多压缩尺度DNN推理加速","authors":"Huamei Qi, Fang Ren, Leilei Wang, Ping Jiang, Shaohua Wan, Xiaoheng Deng","doi":"10.1145/3634704","DOIUrl":null,"url":null,"abstract":"<p>Edge intelligence has emerged as a promising paradigm to accelerate DNN inference by model partitioning, which is particularly useful for intelligent scenarios that demand high accuracy and low latency. However, the dynamic nature of the edge environment and the diversity of end devices pose a significant challenge for DNN model partitioning strategies. Meanwhile, limited resources of edge server make it difficult to manage resource allocation efficiently among multiple devices. In addition, most of the existing studies disregard the different service requirements of the DNN inference tasks, such as its high accuracy-sensitive or high latency-sensitive. To address these challenges, we propose a Multi-Compression Scale DNN Inference Acceleration (MCIA) based on cloud-edge-end collaboration. We model this problem as a mixed-integer multi-dimensional optimization problem, jointly optimizing the DNN model version choice, the partitioning choice, and the allocation of computational and bandwidth resources to maximize the tradeoff between inference accuracy and latency depending on the property of the tasks. Initially, we train multiple versions of DNN inference models with different compression scales in the cloud, and deploy them to end devices and edge server. Next, a deep reinforcement learning-based algorithm is developed for joint decision making of adaptive collaborative inference and resource allocation based on the current multi-compression scale models and the task property. Experimental results show that MCIA can adapt to heterogeneous devices and dynamic networks, and has superior performance compared with other methods.</p>","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"50 8","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration\",\"authors\":\"Huamei Qi, Fang Ren, Leilei Wang, Ping Jiang, Shaohua Wan, Xiaoheng Deng\",\"doi\":\"10.1145/3634704\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Edge intelligence has emerged as a promising paradigm to accelerate DNN inference by model partitioning, which is particularly useful for intelligent scenarios that demand high accuracy and low latency. However, the dynamic nature of the edge environment and the diversity of end devices pose a significant challenge for DNN model partitioning strategies. Meanwhile, limited resources of edge server make it difficult to manage resource allocation efficiently among multiple devices. In addition, most of the existing studies disregard the different service requirements of the DNN inference tasks, such as its high accuracy-sensitive or high latency-sensitive. To address these challenges, we propose a Multi-Compression Scale DNN Inference Acceleration (MCIA) based on cloud-edge-end collaboration. We model this problem as a mixed-integer multi-dimensional optimization problem, jointly optimizing the DNN model version choice, the partitioning choice, and the allocation of computational and bandwidth resources to maximize the tradeoff between inference accuracy and latency depending on the property of the tasks. Initially, we train multiple versions of DNN inference models with different compression scales in the cloud, and deploy them to end devices and edge server. Next, a deep reinforcement learning-based algorithm is developed for joint decision making of adaptive collaborative inference and resource allocation based on the current multi-compression scale models and the task property. Experimental results show that MCIA can adapt to heterogeneous devices and dynamic networks, and has superior performance compared with other methods.</p>\",\"PeriodicalId\":50914,\"journal\":{\"name\":\"ACM Transactions on Embedded Computing Systems\",\"volume\":\"50 8\",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Embedded Computing Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3634704\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Embedded Computing Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3634704","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

边缘智能已经成为一种有前途的范例，通过模型划分来加速DNN推理，这对于需要高精度和低延迟的智能场景特别有用。然而，边缘环境的动态性和终端设备的多样性对深度神经网络模型划分策略提出了重大挑战。同时，由于边缘服务器的资源有限，难以有效地管理多设备之间的资源分配。此外，现有的研究大多忽略了DNN推理任务的不同服务要求，如其高准确性敏感或高延迟敏感。为了解决这些挑战，我们提出了一种基于云边缘协作的多压缩尺度深度神经网络推理加速(MCIA)。我们将该问题建模为一个混合整数多维优化问题，根据任务的性质，共同优化DNN模型版本选择、分区选择以及计算和带宽资源的分配，以最大限度地在推理精度和延迟之间进行权衡。首先，我们在云中训练不同压缩尺度的DNN推理模型的多个版本，并将其部署到终端设备和边缘服务器上。其次，基于现有的多压缩尺度模型和任务属性，提出了一种基于深度强化学习的自适应协同推理和资源分配联合决策算法。实验结果表明，MCIA能够适应异构设备和动态网络，与其他方法相比具有优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Edge intelligence has emerged as a promising paradigm to accelerate DNN inference by model partitioning, which is particularly useful for intelligent scenarios that demand high accuracy and low latency. However, the dynamic nature of the edge environment and the diversity of end devices pose a significant challenge for DNN model partitioning strategies. Meanwhile, limited resources of edge server make it difficult to manage resource allocation efficiently among multiple devices. In addition, most of the existing studies disregard the different service requirements of the DNN inference tasks, such as its high accuracy-sensitive or high latency-sensitive. To address these challenges, we propose a Multi-Compression Scale DNN Inference Acceleration (MCIA) based on cloud-edge-end collaboration. We model this problem as a mixed-integer multi-dimensional optimization problem, jointly optimizing the DNN model version choice, the partitioning choice, and the allocation of computational and bandwidth resources to maximize the tradeoff between inference accuracy and latency depending on the property of the tasks. Initially, we train multiple versions of DNN inference models with different compression scales in the cloud, and deploy them to end devices and edge server. Next, a deep reinforcement learning-based algorithm is developed for joint decision making of adaptive collaborative inference and resource allocation based on the current multi-compression scale models and the task property. Experimental results show that MCIA can adapt to heterogeneous devices and dynamic networks, and has superior performance compared with other methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Embedded Computing Systems 工程技术-计算机：软件工程

CiteScore

3.70

自引率

0.00%

发文量

138

审稿时长

6 months

期刊介绍： The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM Transactions on Embedded Computing Systems (TECS) aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.