{"title":"Optimizing Speculative Decoding for Serving Large Language Models Using Goodput","authors":"Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang","doi":"arxiv-2406.14066","DOIUrl":null,"url":null,"abstract":"Reducing the inference latency of large language models (LLMs) is crucial,\nand speculative decoding (SD) stands out as one of the most effective\ntechniques. Rather than letting the LLM generate all tokens directly,\nspeculative decoding employs effective proxies to predict potential outputs,\nwhich are then verified by the LLM without compromising the generation quality.\nYet, deploying SD in real online LLM serving systems (with continuous batching)\ndoes not always yield improvement -- under higher request rates or low\nspeculation accuracy, it paradoxically increases latency. Furthermore, there is\nno best speculation length work for all workloads under different system loads.\nBased on the observations, we develop a dynamic framework SmartSpec. SmartSpec\ndynamically determines the best speculation length for each request (from 0,\ni.e., no speculation, to many tokens) -- hence the associated speculative\nexecution costs -- based on a new metric called goodput, which characterizes\nthe current observed load of the entire system and the speculation accuracy. We\nshow that SmartSpec consistently reduces average request latency by up to 3.2x\ncompared to non-speculative decoding baselines across different sizes of target\nmodels, draft models, request rates, and datasets. Moreover, SmartSpec can be\napplied to different styles of speculative decoding, including traditional,\nmodel-based approaches as well as model-free methods like prompt lookup and\ntree-style decoding.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.14066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Reducing the inference latency of large language models (LLMs) is crucial,
and speculative decoding (SD) stands out as one of the most effective
techniques. Rather than letting the LLM generate all tokens directly,
speculative decoding employs effective proxies to predict potential outputs,
which are then verified by the LLM without compromising the generation quality.
Yet, deploying SD in real online LLM serving systems (with continuous batching)
does not always yield improvement -- under higher request rates or low
speculation accuracy, it paradoxically increases latency. Furthermore, there is
no best speculation length work for all workloads under different system loads.
Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec
dynamically determines the best speculation length for each request (from 0,
i.e., no speculation, to many tokens) -- hence the associated speculative
execution costs -- based on a new metric called goodput, which characterizes
the current observed load of the entire system and the speculation accuracy. We
show that SmartSpec consistently reduces average request latency by up to 3.2x
compared to non-speculative decoding baselines across different sizes of target
models, draft models, request rates, and datasets. Moreover, SmartSpec can be
applied to different styles of speculative decoding, including traditional,
model-based approaches as well as model-free methods like prompt lookup and
tree-style decoding.
减少大型语言模型(LLM)的推理延迟至关重要,而推测解码(SD)是最有效的技术之一。推测解码不是让 LLM 直接生成所有标记,而是利用有效的代理来预测潜在的输出,然后由 LLM 在不影响生成质量的情况下进行验证。然而,在实际的在线 LLM 服务系统(具有连续批处理功能)中部署 SD 并不总是能带来改善--在请求率较高或推测准确率较低的情况下,它反而会增加延迟。此外,在不同的系统负载下,并不存在适合所有工作负载的最佳推测长度。基于上述观察结果,我们开发了一个动态框架 SmartSpec。SmartSpecynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) - hence the associated speculativeexecution costs - based on a new metric called goodput, which characterizesthe current observed load of the entire system and the speculation accuracy.Wesh显示,与非推测性解码基线相比,在不同规模的目标模型、草稿模型、请求率和数据集上,SmartSpec始终能将平均请求延迟降低3.2x。此外,SmartSpec 还可应用于不同风格的推测式解码,包括传统的基于模型的方法以及无模型方法(如提示查找和树型解码)。