Job assignment in machine learning inference systems with accuracy constraints

IF 1 4区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Performance Evaluation Pub Date : 2024-12-12 DOI:10.1016/j.peva.2024.102463
Tuhinangshu Choudhury , Gauri Joshi , Weina Wang
{"title":"Job assignment in machine learning inference systems with accuracy constraints","authors":"Tuhinangshu Choudhury ,&nbsp;Gauri Joshi ,&nbsp;Weina Wang","doi":"10.1016/j.peva.2024.102463","DOIUrl":null,"url":null,"abstract":"<div><div>Modern machine learning inference systems often host multiple models that can perform the same task with different levels of accuracy and latency. For example, a large model can be more accurate but slow, whereas a smaller and less accurate can be faster in serving inference queries. Amidst the rapid advancements in Large Language Models (LLMs), it is paramount for such systems to strike the best trade-off between latency and accuracy. In this paper, we consider the problem of designing job assignment policies for a multi-server queueing system where servers have heterogeneous rates and accuracies, and our goal is to minimize the expected inference latency while meeting an average accuracy target. Such queueing systems with constraints have been sparsely studied in prior literature to the best of our knowledge. We first identify a lower bound on the minimum achievable latency under any policy that achieves the target accuracy <span><math><msup><mrow><mi>a</mi></mrow><mrow><mo>∗</mo></mrow></msup></math></span> using a linear programming (LP) formulation. Building on the LP solution, we introduce a Randomized-Join-the Idle Queue (R-JIQ) policy, which consistently meets the accuracy target and asymptotically (as system size increases) achieves the optimal latency <span><math><mrow><msub><mrow><mi>T</mi></mrow><mrow><mtext>LP-LB</mtext></mrow></msub><mrow><mo>(</mo><mi>λ</mi><mo>)</mo></mrow></mrow></math></span>. However, the R-JIQ policy relies on the knowledge of the arrival rate <span><math><mi>λ</mi></math></span> to solve the LP. To address this limitation, we propose the Prioritize Ordered Pairs (POP) policy that incorporates the concept of <em>ordered pairs</em> of servers into waterfilling to iteratively solve the LP. This allows the POP policy to function without relying on the arrival rate. Experiments suggest that POP performs robustly across different system sizes and load scenarios, achieving near-optimal performance.</div></div>","PeriodicalId":19964,"journal":{"name":"Performance Evaluation","volume":"167 ","pages":"Article 102463"},"PeriodicalIF":1.0000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance Evaluation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166531624000683","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Modern machine learning inference systems often host multiple models that can perform the same task with different levels of accuracy and latency. For example, a large model can be more accurate but slow, whereas a smaller and less accurate can be faster in serving inference queries. Amidst the rapid advancements in Large Language Models (LLMs), it is paramount for such systems to strike the best trade-off between latency and accuracy. In this paper, we consider the problem of designing job assignment policies for a multi-server queueing system where servers have heterogeneous rates and accuracies, and our goal is to minimize the expected inference latency while meeting an average accuracy target. Such queueing systems with constraints have been sparsely studied in prior literature to the best of our knowledge. We first identify a lower bound on the minimum achievable latency under any policy that achieves the target accuracy a using a linear programming (LP) formulation. Building on the LP solution, we introduce a Randomized-Join-the Idle Queue (R-JIQ) policy, which consistently meets the accuracy target and asymptotically (as system size increases) achieves the optimal latency TLP-LB(λ). However, the R-JIQ policy relies on the knowledge of the arrival rate λ to solve the LP. To address this limitation, we propose the Prioritize Ordered Pairs (POP) policy that incorporates the concept of ordered pairs of servers into waterfilling to iteratively solve the LP. This allows the POP policy to function without relying on the arrival rate. Experiments suggest that POP performs robustly across different system sizes and load scenarios, achieving near-optimal performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
具有准确性约束的机器学习推理系统中的任务分配
现代机器学习推理系统通常包含多个模型,这些模型可以以不同的准确度和延迟执行相同的任务。例如,大型模型可能更准确但速度较慢,而较小且准确度较低的模型在提供推理查询时可能更快。随着大型语言模型(LLM)的快速发展,此类系统必须在延迟和准确性之间取得最佳平衡。在本文中,我们考虑了为多服务器队列系统设计任务分配策略的问题,在该系统中,服务器具有不同的速率和准确度,我们的目标是在满足平均准确度目标的同时最大限度地减少预期推理延迟。据我们所知,以前的文献中对这种具有约束条件的队列系统的研究很少。我们首先使用线性规划(LP)公式确定了在任何可达到目标精度 a∗ 的策略下可实现的最小延迟下限。在 LP 解法的基础上,我们引入了随机加入空闲队列(R-JIQ)策略,该策略可持续满足精度目标,并渐进地(随着系统规模的增加)实现最佳延迟 TLP-LB(λ)。然而,R-JIQ 策略依赖于到达率 λ 的知识来求解 LP。为解决这一局限性,我们提出了优先有序对(POP)策略,该策略将服务器有序对的概念融入到注水中,以迭代方式求解 LP。这使得 POP 策略无需依赖到达率即可发挥作用。实验表明,POP 在不同的系统规模和负载情况下都表现稳健,达到了接近最优的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Performance Evaluation
Performance Evaluation 工程技术-计算机:理论方法
CiteScore
3.10
自引率
0.00%
发文量
20
审稿时长
24 days
期刊介绍: Performance Evaluation functions as a leading journal in the area of modeling, measurement, and evaluation of performance aspects of computing and communication systems. As such, it aims to present a balanced and complete view of the entire Performance Evaluation profession. Hence, the journal is interested in papers that focus on one or more of the following dimensions: -Define new performance evaluation tools, including measurement and monitoring tools as well as modeling and analytic techniques -Provide new insights into the performance of computing and communication systems -Introduce new application areas where performance evaluation tools can play an important role and creative new uses for performance evaluation tools. More specifically, common application areas of interest include the performance of: -Resource allocation and control methods and algorithms (e.g. routing and flow control in networks, bandwidth allocation, processor scheduling, memory management) -System architecture, design and implementation -Cognitive radio -VANETs -Social networks and media -Energy efficient ICT -Energy harvesting -Data centers -Data centric networks -System reliability -System tuning and capacity planning -Wireless and sensor networks -Autonomic and self-organizing systems -Embedded systems -Network science
期刊最新文献
Statistical properties of a class of randomized binary search algorithms Computational algorithms and arrival theorem for non-conventional product-form solutions Editorial Board Energy-performance tradeoffs in server farms with batch services and setup times Foreword - Special Issue - MASCOTS 2023
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1