Job assignment in machine learning inference systems with accuracy constraints

IF 1 4区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Performance Evaluation Pub Date : 2024-12-12 DOI:10.1016/j.peva.2024.102463

Tuhinangshu Choudhury , Gauri Joshi , Weina Wang

{"title":"Job assignment in machine learning inference systems with accuracy constraints","authors":"Tuhinangshu Choudhury , Gauri Joshi , Weina Wang","doi":"10.1016/j.peva.2024.102463","DOIUrl":null,"url":null,"abstract":"<div><div>Modern machine learning inference systems often host multiple models that can perform the same task with different levels of accuracy and latency. For example, a large model can be more accurate but slow, whereas a smaller and less accurate can be faster in serving inference queries. Amidst the rapid advancements in Large Language Models (LLMs), it is paramount for such systems to strike the best trade-off between latency and accuracy. In this paper, we consider the problem of designing job assignment policies for a multi-server queueing system where servers have heterogeneous rates and accuracies, and our goal is to minimize the expected inference latency while meeting an average accuracy target. Such queueing systems with constraints have been sparsely studied in prior literature to the best of our knowledge. We first identify a lower bound on the minimum achievable latency under any policy that achieves the target accuracy <span><math><msup><mrow><mi>a</mi></mrow><mrow><mo>∗</mo></mrow></msup></math></span> using a linear programming (LP) formulation. Building on the LP solution, we introduce a Randomized-Join-the Idle Queue (R-JIQ) policy, which consistently meets the accuracy target and asymptotically (as system size increases) achieves the optimal latency <span><math><mrow><msub><mrow><mi>T</mi></mrow><mrow><mtext>LP-LB</mtext></mrow></msub><mrow><mo>(</mo><mi>λ</mi><mo>)</mo></mrow></mrow></math></span>. However, the R-JIQ policy relies on the knowledge of the arrival rate <span><math><mi>λ</mi></math></span> to solve the LP. To address this limitation, we propose the Prioritize Ordered Pairs (POP) policy that incorporates the concept of <em>ordered pairs</em> of servers into waterfilling to iteratively solve the LP. This allows the POP policy to function without relying on the arrival rate. Experiments suggest that POP performs robustly across different system sizes and load scenarios, achieving near-optimal performance.</div></div>","PeriodicalId":19964,"journal":{"name":"Performance Evaluation","volume":"167 ","pages":"Article 102463"},"PeriodicalIF":1.0000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance Evaluation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166531624000683","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Modern machine learning inference systems often host multiple models that can perform the same task with different levels of accuracy and latency. For example, a large model can be more accurate but slow, whereas a smaller and less accurate can be faster in serving inference queries. Amidst the rapid advancements in Large Language Models (LLMs), it is paramount for such systems to strike the best trade-off between latency and accuracy. In this paper, we consider the problem of designing job assignment policies for a multi-server queueing system where servers have heterogeneous rates and accuracies, and our goal is to minimize the expected inference latency while meeting an average accuracy target. Such queueing systems with constraints have been sparsely studied in prior literature to the best of our knowledge. We first identify a lower bound on the minimum achievable latency under any policy that achieves the target accuracy

a^{*}

using a linear programming (LP) formulation. Building on the LP solution, we introduce a Randomized-Join-the Idle Queue (R-JIQ) policy, which consistently meets the accuracy target and asymptotically (as system size increases) achieves the optimal latency

T_{LP-LB} (λ)

. However, the R-JIQ policy relies on the knowledge of the arrival rate

λ

to solve the LP. To address this limitation, we propose the Prioritize Ordered Pairs (POP) policy that incorporates the concept of ordered pairs of servers into waterfilling to iteratively solve the LP. This allows the POP policy to function without relying on the arrival rate. Experiments suggest that POP performs robustly across different system sizes and load scenarios, achieving near-optimal performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有准确性约束的机器学习推理系统中的任务分配

现代机器学习推理系统通常包含多个模型，这些模型可以以不同的准确度和延迟执行相同的任务。例如，大型模型可能更准确但速度较慢，而较小且准确度较低的模型在提供推理查询时可能更快。随着大型语言模型（LLM）的快速发展，此类系统必须在延迟和准确性之间取得最佳平衡。在本文中，我们考虑了为多服务器队列系统设计任务分配策略的问题，在该系统中，服务器具有不同的速率和准确度，我们的目标是在满足平均准确度目标的同时最大限度地减少预期推理延迟。据我们所知，以前的文献中对这种具有约束条件的队列系统的研究很少。我们首先使用线性规划（LP）公式确定了在任何可达到目标精度 a∗ 的策略下可实现的最小延迟下限。在 LP 解法的基础上，我们引入了随机加入空闲队列（R-JIQ）策略，该策略可持续满足精度目标，并渐进地（随着系统规模的增加）实现最佳延迟 TLP-LB(λ)。然而，R-JIQ 策略依赖于到达率 λ 的知识来求解 LP。为解决这一局限性，我们提出了优先有序对（POP）策略，该策略将服务器有序对的概念融入到注水中，以迭代方式求解 LP。这使得 POP 策略无需依赖到达率即可发挥作用。实验表明，POP 在不同的系统规模和负载情况下都表现稳健，达到了接近最优的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Performance Evaluation 工程技术-计算机：理论方法

CiteScore

3.10

自引率

0.00%

发文量

审稿时长

24 days

期刊介绍： Performance Evaluation functions as a leading journal in the area of modeling, measurement, and evaluation of performance aspects of computing and communication systems. As such, it aims to present a balanced and complete view of the entire Performance Evaluation profession. Hence, the journal is interested in papers that focus on one or more of the following dimensions: -Define new performance evaluation tools, including measurement and monitoring tools as well as modeling and analytic techniques -Provide new insights into the performance of computing and communication systems -Introduce new application areas where performance evaluation tools can play an important role and creative new uses for performance evaluation tools. More specifically, common application areas of interest include the performance of: -Resource allocation and control methods and algorithms (e.g. routing and flow control in networks, bandwidth allocation, processor scheduling, memory management) -System architecture, design and implementation -Cognitive radio -VANETs -Social networks and media -Energy efficient ICT -Energy harvesting -Data centers -Data centric networks -System reliability -System tuning and capacity planning -Wireless and sensor networks -Autonomic and self-organizing systems -Embedded systems -Network science

期刊最新文献

Statistical properties of a class of randomized binary search algorithms Computational algorithms and arrival theorem for non-conventional product-form solutions Editorial Board Energy-performance tradeoffs in server farms with batch services and setup times Foreword - Special Issue - MASCOTS 2023