ELMS: Elasticized Large Language Models On Mobile Devices

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-08 DOI:arxiv-2409.09071

Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu

{"title":"ELMS: Elasticized Large Language Models On Mobile Devices","authors":"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu","doi":"arxiv-2409.09071","DOIUrl":null,"url":null,"abstract":"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\nenabling applications such as UI automation while addressing privacy concerns.\nCurrently, the standard approach involves deploying a single, robust LLM as a\nuniversal solution for various applications, often referred to as\nLLM-as-a-Service (LLMaaS). However, this approach faces a significant system\nchallenge: existing LLMs lack the flexibility to accommodate the diverse\nService-Level Objectives (SLOs) regarding inference latency across different\napplications. To address this issue, we introduce ELMS, an on-device LLM\nservice designed to provide elasticity in both the model and prompt dimensions\nof an LLMaaS. This system includes: A one-time neuron reordering technique,\nwhich utilizes the inherent permutation consistency within transformer models\nto create high-quality, elastic sub-models with minimal runtime switching\ncosts. A dual-head compact language model, which efficiently refines prompts\nand coordinates the elastic adaptation between the model and the prompt. We\nhave implemented this elastic on-device LLM service on several off-the-shelf\n(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\ndatasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\nsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\non average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\ncomparable memory usage, and fewer than 100 offline GPU hours.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling applications such as UI automation while addressing privacy concerns. Currently, the standard approach involves deploying a single, robust LLM as a universal solution for various applications, often referred to as LLM-as-a-Service (LLMaaS). However, this approach faces a significant system challenge: existing LLMs lack the flexibility to accommodate the diverse Service-Level Objectives (SLOs) regarding inference latency across different applications. To address this issue, we introduce ELMS, an on-device LLM service designed to provide elasticity in both the model and prompt dimensions of an LLMaaS. This system includes: A one-time neuron reordering technique, which utilizes the inherent permutation consistency within transformer models to create high-quality, elastic sub-models with minimal runtime switching costs. A dual-head compact language model, which efficiently refines prompts and coordinates the elastic adaptation between the model and the prompt. We have implemented this elastic on-device LLM service on several off-the-shelf (COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy on average, with less than 1% Time-To-First-Token (TTFT) switching overhead, comparable memory usage, and fewer than 100 offline GPU hours.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ELMS：移动设备上的弹性大型语言模型

设备上的大型语言模型（LLM）正在彻底改变移动人工智能，使用户界面自动化等应用成为可能，同时解决了隐私问题。目前，标准方法包括部署一个单一、强大的 LLM，作为各种应用的通用解决方案，通常称为 LLM 即服务（LLMaaS）。然而，这种方法面临着一个重大的系统挑战：现有的 LLM 缺乏灵活性，无法满足不同应用在推理延迟方面的不同服务级别目标（SLO）。为了解决这个问题，我们推出了 ELMS，这是一种设备上的 LLM 服务，旨在为 LLMaaS 的模型和提示维度提供弹性。该系统包括一次性神经元重排序技术，它利用变压器模型中固有的排列一致性来创建高质量的弹性子模型，并将运行时的切换成本降至最低。双头紧凑语言模型，可高效地完善提示语，并协调模型与提示语之间的弹性适应。我们在几款现成的（COTS）智能手机上实现了这种弹性的设备上 LLM 服务，并使用独立的 NLP/移动标记数据集和合成的端到端跟踪对 ELMS 进行了评估。在一系列SLO中，ELMS的绝对准确率超过了四种强大的基线，平均高达16.83%和11.04%，而首次令牌时间（TTFT）切换开销不到1%，内存使用量相当，离线GPU时长不到100小时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844