Interference-Aware Scheduling for Inference Serving

Daniel Mendoza, Francisco Romero, Qian Li, N. Yadwadkar, C. Kozyrakis
{"title":"Interference-Aware Scheduling for Inference Serving","authors":"Daniel Mendoza, Francisco Romero, Qian Li, N. Yadwadkar, C. Kozyrakis","doi":"10.1145/3437984.3458837","DOIUrl":null,"url":null,"abstract":"Machine learning inference applications have proliferated through diverse domains such as healthcare, security, and analytics. Recent work has proposed inference serving systems for improving the deployment and scalability of models. To improve resource utilization, multiple models can be co-located on the same backend machine. However, co-location can cause latency degradation due to interference and can subsequently violate latency requirements. Although interference-aware schedulers for general workloads have been introduced, they do not scale appropriately to heterogeneous inference serving systems where the number of co-location configurations grows exponentially with the number of models and machine types. This paper proposes an interference-aware scheduler for heterogeneous inference serving systems, reducing the latency degradation from co-location interference. We characterize the challenges in predicting the impact of co-location interference on inference latency (e.g., varying latency degradation across machine types), and identify properties of models and hardware that should be considered during scheduling. We then propose a unified prediction model that estimates an inference model's latency degradation during co-location, and develop an interference-aware scheduler that leverages this predictor. Our preliminary results show that our interference-aware scheduler achieves 2× lower latency degradation than a commonly used least-loaded scheduler. We also discuss future research directions for interference-aware schedulers for inference serving systems.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on Machine Learning and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437984.3458837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

Machine learning inference applications have proliferated through diverse domains such as healthcare, security, and analytics. Recent work has proposed inference serving systems for improving the deployment and scalability of models. To improve resource utilization, multiple models can be co-located on the same backend machine. However, co-location can cause latency degradation due to interference and can subsequently violate latency requirements. Although interference-aware schedulers for general workloads have been introduced, they do not scale appropriately to heterogeneous inference serving systems where the number of co-location configurations grows exponentially with the number of models and machine types. This paper proposes an interference-aware scheduler for heterogeneous inference serving systems, reducing the latency degradation from co-location interference. We characterize the challenges in predicting the impact of co-location interference on inference latency (e.g., varying latency degradation across machine types), and identify properties of models and hardware that should be considered during scheduling. We then propose a unified prediction model that estimates an inference model's latency degradation during co-location, and develop an interference-aware scheduler that leverages this predictor. Our preliminary results show that our interference-aware scheduler achieves 2× lower latency degradation than a commonly used least-loaded scheduler. We also discuss future research directions for interference-aware schedulers for inference serving systems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于干扰感知的推理服务调度
机器学习推理应用程序已经在医疗保健、安全和分析等不同领域激增。最近的工作提出了用于改进模型部署和可扩展性的推理服务系统。为了提高资源利用率,可以将多个模型放在同一台后端机器上。但是,协同位置可能会由于干扰而导致延迟降低,并随后可能违反延迟要求。尽管已经为一般工作负载引入了干扰感知调度器,但它们不能适当地扩展到异构推理服务系统,在这些系统中,协同定位配置的数量随着模型和机器类型的数量呈指数级增长。针对异构推理服务系统,提出了一种干扰感知调度器,减少了同位干扰带来的延迟退化。我们描述了在预测同址干扰对推理延迟的影响方面的挑战(例如,不同机器类型的延迟退化),并确定了在调度期间应该考虑的模型和硬件的属性。然后,我们提出了一个统一的预测模型,该模型估计了一个推理模型在共置期间的延迟退化,并开发了一个利用该预测器的干扰感知调度器。我们的初步结果表明,我们的干扰感知调度器比常用的最小负载调度器实现了低2倍的延迟退化。讨论了推理服务系统中干扰感知调度器的未来研究方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards Mitigating Device Heterogeneity in Federated Learning via Adaptive Model Quantization Queen Jane Approximately: Enabling Efficient Neural Network Inference with Context-Adaptivity Are we there yet? Estimating Training Time for Recommendation Systems Predicting CPU usage for proactive autoscaling Towards Optimal Configuration of Microservices
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1