{"title":"Contemporary Model Compression on Large Language Models Inference","authors":"Dong Liu","doi":"arxiv-2409.01990","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have revolutionized natural language processing\nby achieving state-of-the-art results across a variety of tasks. However, the\ncomputational demands of LLM inference, including high memory consumption and\nslow processing speeds, pose significant challenges for real-world\napplications, particularly on resource-constrained devices. Efficient inference\nis crucial for scaling the deployment of LLMs to a broader range of platforms,\nincluding mobile and edge devices. This survey explores contemporary techniques in model compression that\naddress these challenges by reducing the size and computational requirements of\nLLMs while maintaining their performance. We focus on model-level compression\nmethods, including quantization, knowledge distillation, and pruning, as well\nas system-level optimizations like KV cache efficient design. Each of these\nmethodologies offers a unique approach to optimizing LLMs, from reducing\nnumerical precision to transferring knowledge between models and structurally\nsimplifying neural networks. Additionally, we discuss emerging trends in\nsystem-level design that further enhance the efficiency of LLM inference. This\nsurvey aims to provide a comprehensive overview of current advancements in\nmodel compression and their potential to make LLMs more accessible and\npractical for diverse applications.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"33 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01990","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) have revolutionized natural language processing
by achieving state-of-the-art results across a variety of tasks. However, the
computational demands of LLM inference, including high memory consumption and
slow processing speeds, pose significant challenges for real-world
applications, particularly on resource-constrained devices. Efficient inference
is crucial for scaling the deployment of LLMs to a broader range of platforms,
including mobile and edge devices. This survey explores contemporary techniques in model compression that
address these challenges by reducing the size and computational requirements of
LLMs while maintaining their performance. We focus on model-level compression
methods, including quantization, knowledge distillation, and pruning, as well
as system-level optimizations like KV cache efficient design. Each of these
methodologies offers a unique approach to optimizing LLMs, from reducing
numerical precision to transferring knowledge between models and structurally
simplifying neural networks. Additionally, we discuss emerging trends in
system-level design that further enhance the efficiency of LLM inference. This
survey aims to provide a comprehensive overview of current advancements in
model compression and their potential to make LLMs more accessible and
practical for diverse applications.