Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu
{"title":"ELMS: Elasticized Large Language Models On Mobile Devices","authors":"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu","doi":"arxiv-2409.09071","DOIUrl":null,"url":null,"abstract":"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\nenabling applications such as UI automation while addressing privacy concerns.\nCurrently, the standard approach involves deploying a single, robust LLM as a\nuniversal solution for various applications, often referred to as\nLLM-as-a-Service (LLMaaS). However, this approach faces a significant system\nchallenge: existing LLMs lack the flexibility to accommodate the diverse\nService-Level Objectives (SLOs) regarding inference latency across different\napplications. To address this issue, we introduce ELMS, an on-device LLM\nservice designed to provide elasticity in both the model and prompt dimensions\nof an LLMaaS. This system includes: A one-time neuron reordering technique,\nwhich utilizes the inherent permutation consistency within transformer models\nto create high-quality, elastic sub-models with minimal runtime switching\ncosts. A dual-head compact language model, which efficiently refines prompts\nand coordinates the elastic adaptation between the model and the prompt. We\nhave implemented this elastic on-device LLM service on several off-the-shelf\n(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\ndatasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\nsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\non average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\ncomparable memory usage, and fewer than 100 offline GPU hours.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
On-device Large Language Models (LLMs) are revolutionizing mobile AI,
enabling applications such as UI automation while addressing privacy concerns.
Currently, the standard approach involves deploying a single, robust LLM as a
universal solution for various applications, often referred to as
LLM-as-a-Service (LLMaaS). However, this approach faces a significant system
challenge: existing LLMs lack the flexibility to accommodate the diverse
Service-Level Objectives (SLOs) regarding inference latency across different
applications. To address this issue, we introduce ELMS, an on-device LLM
service designed to provide elasticity in both the model and prompt dimensions
of an LLMaaS. This system includes: A one-time neuron reordering technique,
which utilizes the inherent permutation consistency within transformer models
to create high-quality, elastic sub-models with minimal runtime switching
costs. A dual-head compact language model, which efficiently refines prompts
and coordinates the elastic adaptation between the model and the prompt. We
have implemented this elastic on-device LLM service on several off-the-shelf
(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent
datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS
surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy
on average, with less than 1% Time-To-First-Token (TTFT) switching overhead,
comparable memory usage, and fewer than 100 offline GPU hours.