Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu
{"title":"ELMS:移动设备上的弹性大型语言模型","authors":"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu","doi":"arxiv-2409.09071","DOIUrl":null,"url":null,"abstract":"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\nenabling applications such as UI automation while addressing privacy concerns.\nCurrently, the standard approach involves deploying a single, robust LLM as a\nuniversal solution for various applications, often referred to as\nLLM-as-a-Service (LLMaaS). However, this approach faces a significant system\nchallenge: existing LLMs lack the flexibility to accommodate the diverse\nService-Level Objectives (SLOs) regarding inference latency across different\napplications. To address this issue, we introduce ELMS, an on-device LLM\nservice designed to provide elasticity in both the model and prompt dimensions\nof an LLMaaS. This system includes: A one-time neuron reordering technique,\nwhich utilizes the inherent permutation consistency within transformer models\nto create high-quality, elastic sub-models with minimal runtime switching\ncosts. A dual-head compact language model, which efficiently refines prompts\nand coordinates the elastic adaptation between the model and the prompt. We\nhave implemented this elastic on-device LLM service on several off-the-shelf\n(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\ndatasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\nsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\non average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\ncomparable memory usage, and fewer than 100 offline GPU hours.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ELMS: Elasticized Large Language Models On Mobile Devices\",\"authors\":\"Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu\",\"doi\":\"arxiv-2409.09071\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"On-device Large Language Models (LLMs) are revolutionizing mobile AI,\\nenabling applications such as UI automation while addressing privacy concerns.\\nCurrently, the standard approach involves deploying a single, robust LLM as a\\nuniversal solution for various applications, often referred to as\\nLLM-as-a-Service (LLMaaS). However, this approach faces a significant system\\nchallenge: existing LLMs lack the flexibility to accommodate the diverse\\nService-Level Objectives (SLOs) regarding inference latency across different\\napplications. To address this issue, we introduce ELMS, an on-device LLM\\nservice designed to provide elasticity in both the model and prompt dimensions\\nof an LLMaaS. This system includes: A one-time neuron reordering technique,\\nwhich utilizes the inherent permutation consistency within transformer models\\nto create high-quality, elastic sub-models with minimal runtime switching\\ncosts. A dual-head compact language model, which efficiently refines prompts\\nand coordinates the elastic adaptation between the model and the prompt. We\\nhave implemented this elastic on-device LLM service on several off-the-shelf\\n(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent\\ndatasets and synthesized end-to-end traces. Across a range of SLOs, ELMS\\nsurpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy\\non average, with less than 1% Time-To-First-Token (TTFT) switching overhead,\\ncomparable memory usage, and fewer than 100 offline GPU hours.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"31 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09071\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ELMS: Elasticized Large Language Models On Mobile Devices
On-device Large Language Models (LLMs) are revolutionizing mobile AI,
enabling applications such as UI automation while addressing privacy concerns.
Currently, the standard approach involves deploying a single, robust LLM as a
universal solution for various applications, often referred to as
LLM-as-a-Service (LLMaaS). However, this approach faces a significant system
challenge: existing LLMs lack the flexibility to accommodate the diverse
Service-Level Objectives (SLOs) regarding inference latency across different
applications. To address this issue, we introduce ELMS, an on-device LLM
service designed to provide elasticity in both the model and prompt dimensions
of an LLMaaS. This system includes: A one-time neuron reordering technique,
which utilizes the inherent permutation consistency within transformer models
to create high-quality, elastic sub-models with minimal runtime switching
costs. A dual-head compact language model, which efficiently refines prompts
and coordinates the elastic adaptation between the model and the prompt. We
have implemented this elastic on-device LLM service on several off-the-shelf
(COTS) smartphones and evaluate ELMS using both standalone NLP/mobile-agent
datasets and synthesized end-to-end traces. Across a range of SLOs, ELMS
surpasses four strong baselines by up to 16.83% and 11.04% in absolute accuracy
on average, with less than 1% Time-To-First-Token (TTFT) switching overhead,
comparable memory usage, and fewer than 100 offline GPU hours.