Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

arXiv - CS - Performance Pub Date : 2024-06-20 DOI:arxiv-2406.14066

Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

{"title":"Optimizing Speculative Decoding for Serving Large Language Models Using Goodput","authors":"Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang","doi":"arxiv-2406.14066","DOIUrl":null,"url":null,"abstract":"Reducing the inference latency of large language models (LLMs) is crucial,\nand speculative decoding (SD) stands out as one of the most effective\ntechniques. Rather than letting the LLM generate all tokens directly,\nspeculative decoding employs effective proxies to predict potential outputs,\nwhich are then verified by the LLM without compromising the generation quality.\nYet, deploying SD in real online LLM serving systems (with continuous batching)\ndoes not always yield improvement -- under higher request rates or low\nspeculation accuracy, it paradoxically increases latency. Furthermore, there is\nno best speculation length work for all workloads under different system loads.\nBased on the observations, we develop a dynamic framework SmartSpec. SmartSpec\ndynamically determines the best speculation length for each request (from 0,\ni.e., no speculation, to many tokens) -- hence the associated speculative\nexecution costs -- based on a new metric called goodput, which characterizes\nthe current observed load of the entire system and the speculation accuracy. We\nshow that SmartSpec consistently reduces average request latency by up to 3.2x\ncompared to non-speculative decoding baselines across different sizes of target\nmodels, draft models, request rates, and datasets. Moreover, SmartSpec can be\napplied to different styles of speculative decoding, including traditional,\nmodel-based approaches as well as model-free methods like prompt lookup and\ntree-style decoding.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.14066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用 Goodput 优化为大型语言模型提供服务的推测性解码

减少大型语言模型（LLM）的推理延迟至关重要，而推测解码（SD）是最有效的技术之一。推测解码不是让 LLM 直接生成所有标记，而是利用有效的代理来预测潜在的输出，然后由 LLM 在不影响生成质量的情况下进行验证。然而，在实际的在线 LLM 服务系统（具有连续批处理功能）中部署 SD 并不总是能带来改善--在请求率较高或推测准确率较低的情况下，它反而会增加延迟。此外，在不同的系统负载下，并不存在适合所有工作负载的最佳推测长度。基于上述观察结果，我们开发了一个动态框架 SmartSpec。SmartSpecynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) - hence the associated speculativeexecution costs - based on a new metric called goodput, which characterizesthe current observed load of the entire system and the speculation accuracy.Wesh显示，与非推测性解码基线相比，在不同规模的目标模型、草稿模型、请求率和数据集上，SmartSpec始终能将平均请求延迟降低3.2x。此外，SmartSpec 还可应用于不同风格的推测式解码，包括传统的基于模型的方法以及无模型方法（如提示查找和树型解码）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量