ELMS: Elasticized Large Language Models On Mobile Devices

Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu
{"title":"ELMS: Elasticized Large Language Models On Mobile Devices","authors":"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu","doi":"arxiv-2409.09071","DOIUrl":null,"url":null,"abstract":"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\nenabling applications such as UI automation while addressing privacy concerns.\nCurrently, the standard approach involves deploying a single, robust LLM as a\nuniversal solution for various applications, often referred to as\nLLM-as-a-Service (LLMaaS). However, this approach faces a significant system\nchallenge: existing LLMs lack the flexibility to accommodate the diverse\nService-Level Objectives (SLOs) regarding inference latency across different\napplications. To address this issue, we introduce ELMS, an on-device LLM\nservice designed to provide elasticity in both the model and prompt dimensions\nof an LLMaaS. This system includes: A one-time neuron reordering technique,\nwhich utilizes the inherent permutation consistency within transformer models\nto create high-quality, elastic sub-models with minimal runtime switching\ncosts. A dual-head compact language model, which efficiently refines prompts\nand coordinates the elastic adaptation between the model and the prompt. We\nhave implemented this elastic on-device LLM service on several off-the-shelf\n(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\ndatasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\nsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\non average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\ncomparable memory usage, and fewer than 100 offline GPU hours.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling applications such as UI automation while addressing privacy concerns. Currently, the standard approach involves deploying a single, robust LLM as a universal solution for various applications, often referred to as LLM-as-a-Service (LLMaaS). However, this approach faces a significant system challenge: existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. To address this issue, we introduce ELMS, an on-device LLM service designed to provide elasticity in both the model and prompt dimensions of an LLMaaS. This system includes: A one-time neuron reordering technique, which utilizes the inherent permutation consistency within transformer models to create high-quality, elastic sub-models with minimal runtime switching costs. A dual-head compact language model, which efficiently refines prompts and coordinates the elastic adaptation between the model and the prompt. We have implemented this elastic on-device LLM service on several off-the-shelf (COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy on average, with less than 1% Time-To-First-Token (TTFT) switching overhead, comparable memory usage, and fewer than 100 offline GPU hours.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ELMS:移动设备上的弹性大型语言模型
设备上的大型语言模型(LLM)正在彻底改变移动人工智能,使用户界面自动化等应用成为可能,同时解决了隐私问题。目前,标准方法包括部署一个单一、强大的 LLM,作为各种应用的通用解决方案,通常称为 LLM 即服务(LLMaaS)。然而,这种方法面临着一个重大的系统挑战:现有的 LLM 缺乏灵活性,无法满足不同应用在推理延迟方面的不同服务级别目标(SLO)。为了解决这个问题,我们推出了 ELMS,这是一种设备上的 LLM 服务,旨在为 LLMaaS 的模型和提示维度提供弹性。该系统包括一次性神经元重排序技术,它利用变压器模型中固有的排列一致性来创建高质量的弹性子模型,并将运行时的切换成本降至最低。双头紧凑语言模型,可高效地完善提示语,并协调模型与提示语之间的弹性适应。我们在几款现成的(COTS)智能手机上实现了这种弹性的设备上 LLM 服务,并使用独立的 NLP/移动标记数据集和合成的端到端跟踪对 ELMS 进行了评估。在一系列SLO中,ELMS的绝对准确率超过了四种强大的基线,平均高达16.83%和11.04%,而首次令牌时间(TTFT)切换开销不到1%,内存使用量相当,离线GPU时长不到100小时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1