Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang
{"title":"Optimizing Speculative Decoding for Serving Large Language Models Using Goodput","authors":"Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang","doi":"arxiv-2406.14066","DOIUrl":null,"url":null,"abstract":"Reducing the inference latency of large language models (LLMs) is crucial,\nand speculative decoding (SD) stands out as one of the most effective\ntechniques. Rather than letting the LLM generate all tokens directly,\nspeculative decoding employs effective proxies to predict potential outputs,\nwhich are then verified by the LLM without compromising the generation quality.\nYet, deploying SD in real online LLM serving systems (with continuous batching)\ndoes not always yield improvement -- under higher request rates or low\nspeculation accuracy, it paradoxically increases latency. Furthermore, there is\nno best speculation length work for all workloads under different system loads.\nBased on the observations, we develop a dynamic framework SmartSpec. SmartSpec\ndynamically determines the best speculation length for each request (from 0,\ni.e., no speculation, to many tokens) -- hence the associated speculative\nexecution costs -- based on a new metric called goodput, which characterizes\nthe current observed load of the entire system and the speculation accuracy. We\nshow that SmartSpec consistently reduces average request latency by up to 3.2x\ncompared to non-speculative decoding baselines across different sizes of target\nmodels, draft models, request rates, and datasets. Moreover, SmartSpec can be\napplied to different styles of speculative decoding, including traditional,\nmodel-based approaches as well as model-free methods like prompt lookup and\ntree-style decoding.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.14066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用 Goodput 优化为大型语言模型提供服务的推测性解码
减少大型语言模型(LLM)的推理延迟至关重要,而推测解码(SD)是最有效的技术之一。推测解码不是让 LLM 直接生成所有标记,而是利用有效的代理来预测潜在的输出,然后由 LLM 在不影响生成质量的情况下进行验证。然而,在实际的在线 LLM 服务系统(具有连续批处理功能)中部署 SD 并不总是能带来改善--在请求率较高或推测准确率较低的情况下,它反而会增加延迟。此外,在不同的系统负载下,并不存在适合所有工作负载的最佳推测长度。基于上述观察结果,我们开发了一个动态框架 SmartSpec。SmartSpecynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) - hence the associated speculativeexecution costs - based on a new metric called goodput, which characterizesthe current observed load of the entire system and the speculation accuracy.Wesh显示,与非推测性解码基线相比,在不同规模的目标模型、草稿模型、请求率和数据集上,SmartSpec始终能将平均请求延迟降低3.2x。此外,SmartSpec 还可应用于不同风格的推测式解码,包括传统的基于模型的方法以及无模型方法(如提示查找和树型解码)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study The Landscape of GPU-Centric Communication A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1