{"title":"Job assignment in machine learning inference systems with accuracy constraints","authors":"Tuhinangshu Choudhury , Gauri Joshi , Weina Wang","doi":"10.1016/j.peva.2024.102463","DOIUrl":null,"url":null,"abstract":"<div><div>Modern machine learning inference systems often host multiple models that can perform the same task with different levels of accuracy and latency. For example, a large model can be more accurate but slow, whereas a smaller and less accurate can be faster in serving inference queries. Amidst the rapid advancements in Large Language Models (LLMs), it is paramount for such systems to strike the best trade-off between latency and accuracy. In this paper, we consider the problem of designing job assignment policies for a multi-server queueing system where servers have heterogeneous rates and accuracies, and our goal is to minimize the expected inference latency while meeting an average accuracy target. Such queueing systems with constraints have been sparsely studied in prior literature to the best of our knowledge. We first identify a lower bound on the minimum achievable latency under any policy that achieves the target accuracy <span><math><msup><mrow><mi>a</mi></mrow><mrow><mo>∗</mo></mrow></msup></math></span> using a linear programming (LP) formulation. Building on the LP solution, we introduce a Randomized-Join-the Idle Queue (R-JIQ) policy, which consistently meets the accuracy target and asymptotically (as system size increases) achieves the optimal latency <span><math><mrow><msub><mrow><mi>T</mi></mrow><mrow><mtext>LP-LB</mtext></mrow></msub><mrow><mo>(</mo><mi>λ</mi><mo>)</mo></mrow></mrow></math></span>. However, the R-JIQ policy relies on the knowledge of the arrival rate <span><math><mi>λ</mi></math></span> to solve the LP. To address this limitation, we propose the Prioritize Ordered Pairs (POP) policy that incorporates the concept of <em>ordered pairs</em> of servers into waterfilling to iteratively solve the LP. This allows the POP policy to function without relying on the arrival rate. Experiments suggest that POP performs robustly across different system sizes and load scenarios, achieving near-optimal performance.</div></div>","PeriodicalId":19964,"journal":{"name":"Performance Evaluation","volume":"167 ","pages":"Article 102463"},"PeriodicalIF":1.0000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance Evaluation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166531624000683","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Modern machine learning inference systems often host multiple models that can perform the same task with different levels of accuracy and latency. For example, a large model can be more accurate but slow, whereas a smaller and less accurate can be faster in serving inference queries. Amidst the rapid advancements in Large Language Models (LLMs), it is paramount for such systems to strike the best trade-off between latency and accuracy. In this paper, we consider the problem of designing job assignment policies for a multi-server queueing system where servers have heterogeneous rates and accuracies, and our goal is to minimize the expected inference latency while meeting an average accuracy target. Such queueing systems with constraints have been sparsely studied in prior literature to the best of our knowledge. We first identify a lower bound on the minimum achievable latency under any policy that achieves the target accuracy using a linear programming (LP) formulation. Building on the LP solution, we introduce a Randomized-Join-the Idle Queue (R-JIQ) policy, which consistently meets the accuracy target and asymptotically (as system size increases) achieves the optimal latency . However, the R-JIQ policy relies on the knowledge of the arrival rate to solve the LP. To address this limitation, we propose the Prioritize Ordered Pairs (POP) policy that incorporates the concept of ordered pairs of servers into waterfilling to iteratively solve the LP. This allows the POP policy to function without relying on the arrival rate. Experiments suggest that POP performs robustly across different system sizes and load scenarios, achieving near-optimal performance.
期刊介绍:
Performance Evaluation functions as a leading journal in the area of modeling, measurement, and evaluation of performance aspects of computing and communication systems. As such, it aims to present a balanced and complete view of the entire Performance Evaluation profession. Hence, the journal is interested in papers that focus on one or more of the following dimensions:
-Define new performance evaluation tools, including measurement and monitoring tools as well as modeling and analytic techniques
-Provide new insights into the performance of computing and communication systems
-Introduce new application areas where performance evaluation tools can play an important role and creative new uses for performance evaluation tools.
More specifically, common application areas of interest include the performance of:
-Resource allocation and control methods and algorithms (e.g. routing and flow control in networks, bandwidth allocation, processor scheduling, memory management)
-System architecture, design and implementation
-Cognitive radio
-VANETs
-Social networks and media
-Energy efficient ICT
-Energy harvesting
-Data centers
-Data centric networks
-System reliability
-System tuning and capacity planning
-Wireless and sensor networks
-Autonomic and self-organizing systems
-Embedded systems
-Network science