OptimML: Joint Control of Inference Latency and Server Power Consumption for ML Performance Optimization

IF 2.2 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Autonomous and Adaptive Systems Pub Date : 2024-05-07 DOI:10.1145/3661825
Guoyu Chen, Xiaorui Wang
{"title":"OptimML: Joint Control of Inference Latency and Server Power Consumption for ML Performance Optimization","authors":"Guoyu Chen, Xiaorui Wang","doi":"10.1145/3661825","DOIUrl":null,"url":null,"abstract":"<p>Power capping is an important technique for high-density servers to safely oversubscribe the power infrastructure in a data center. However, power capping is commonly accomplished by dynamically lowering the server processors’ frequency levels, which can result in degraded application performance. For servers that run important machine learning (ML) applications with Service-Level Objective (SLO) requirements, inference performance such as recognition accuracy must be optimized within a certain latency constraint, which demands high server performance. In order to achieve the best inference accuracy under the desired latency and server power constraints, this paper proposes OptimML, a multi-input-multi-output (MIMO) control framework that jointly controls both inference latency and server power consumption, by flexibly adjusting the machine learning model size (and so its required computing resources) when server frequency needs to be lowered for power capping. Our results on a hardware testbed with widely adopted ML framework (including PyTorch, TensorFlow, and MXNet) show that OptimML achieves higher inference accuracy compared with several well-designed baselines, while respecting both latency and power constraints. Furthermore, an adaptive control scheme with online model switching and estimation is designed to achieve analytic assurance of control accuracy and system stability, even in the face of significant workload/hardware variations.</p>","PeriodicalId":50919,"journal":{"name":"ACM Transactions on Autonomous and Adaptive Systems","volume":"63 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Autonomous and Adaptive Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3661825","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Power capping is an important technique for high-density servers to safely oversubscribe the power infrastructure in a data center. However, power capping is commonly accomplished by dynamically lowering the server processors’ frequency levels, which can result in degraded application performance. For servers that run important machine learning (ML) applications with Service-Level Objective (SLO) requirements, inference performance such as recognition accuracy must be optimized within a certain latency constraint, which demands high server performance. In order to achieve the best inference accuracy under the desired latency and server power constraints, this paper proposes OptimML, a multi-input-multi-output (MIMO) control framework that jointly controls both inference latency and server power consumption, by flexibly adjusting the machine learning model size (and so its required computing resources) when server frequency needs to be lowered for power capping. Our results on a hardware testbed with widely adopted ML framework (including PyTorch, TensorFlow, and MXNet) show that OptimML achieves higher inference accuracy compared with several well-designed baselines, while respecting both latency and power constraints. Furthermore, an adaptive control scheme with online model switching and estimation is designed to achieve analytic assurance of control accuracy and system stability, even in the face of significant workload/hardware variations.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
OptimML:联合控制推理延迟和服务器功耗,优化 ML 性能
功率封顶是高密度服务器的一项重要技术,可安全地超额分配数据中心的电力基础设施。然而,功率封顶通常是通过动态降低服务器处理器的频率水平来实现的,这会导致应用性能下降。对于运行有服务级目标(SLO)要求的重要机器学习(ML)应用的服务器来说,必须在一定的延迟约束内优化推理性能(如识别准确率),这就要求服务器具有很高的性能。为了在所需的延迟和服务器功耗限制条件下实现最佳推理精度,本文提出了多输入多输出(MIMO)控制框架 OptimML,当服务器频率需要降低以达到功耗上限时,通过灵活调整机器学习模型的大小(因此也调整了所需的计算资源)来共同控制推理延迟和服务器功耗。我们在采用广泛应用的 ML 框架(包括 PyTorch、TensorFlow 和 MXNet)的硬件测试平台上取得的结果表明,与几种精心设计的基线相比,OptimML 实现了更高的推理精度,同时遵守了延迟和功耗限制。此外,我们还设计了一种具有在线模型切换和估计功能的自适应控制方案,以实现对控制精度和系统稳定性的分析保证,即使面对显著的工作负载/硬件变化也是如此。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Autonomous and Adaptive Systems
ACM Transactions on Autonomous and Adaptive Systems 工程技术-计算机:理论方法
CiteScore
4.80
自引率
7.40%
发文量
9
审稿时长
>12 weeks
期刊介绍: TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community -- and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors. TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community - and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors. Contributions are expected to be based on sound and innovative theoretical models, algorithms, engineering and programming techniques, infrastructures and systems, or technological and application experiences.
期刊最新文献
IBAQ: Frequency-Domain Backdoor Attack Threatening Autonomous Driving via Quadratic Phase Adaptive Scheduling of High-Availability Drone Swarms for Congestion Alleviation in Connected Automated Vehicles Self-Supervised Machine Learning Framework for Online Container Security Attack Detection A Framework for Simultaneous Task Allocation and Planning under Uncertainty Adaptation in Edge Computing: A review on design principles and research challenges
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1