PowerInfer：使用消费级 GPU 快速处理大型语言模型

arXiv - CS - Operating Systems Pub Date : 2023-12-16 DOI:arxiv-2312.12456

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen

{"title":"PowerInfer：使用消费级 GPU 快速处理大型语言模型","authors":"Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen","doi":"arxiv-2312.12456","DOIUrl":null,"url":null,"abstract":"This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\ninference engine on a personal computer (PC) equipped with a single\nconsumer-grade GPU. The key underlying the design of PowerInfer is exploiting\nthe high locality inherent in LLM inference, characterized by a power-law\ndistribution in neuron activation. This distribution indicates that a small\nsubset of neurons, termed hot neurons, are consistently activated across\ninputs, while the majority, cold neurons, vary based on specific inputs.\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\nadaptive predictors and neuron-aware sparse operators, optimizing the\nefficiency of neuron activation and computational sparsity. Evaluation shows\nthat PowerInfer attains an average token generation rate of 13.20 tokens/s,\nwith a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a\nsingle NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier\nserver-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x\nwhile retaining model accuracy.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU\",\"authors\":\"Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen\",\"doi\":\"arxiv-2312.12456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\\ninference engine on a personal computer (PC) equipped with a single\\nconsumer-grade GPU. The key underlying the design of PowerInfer is exploiting\\nthe high locality inherent in LLM inference, characterized by a power-law\\ndistribution in neuron activation. This distribution indicates that a small\\nsubset of neurons, termed hot neurons, are consistently activated across\\ninputs, while the majority, cold neurons, vary based on specific inputs.\\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\\nadaptive predictors and neuron-aware sparse operators, optimizing the\\nefficiency of neuron activation and computational sparsity. Evaluation shows\\nthat PowerInfer attains an average token generation rate of 13.20 tokens/s,\\nwith a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a\\nsingle NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier\\nserver-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x\\nwhile retaining model accuracy.\",\"PeriodicalId\":501333,\"journal\":{\"name\":\"arXiv - CS - Operating Systems\",\"volume\":\"58 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2312.12456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.12456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文介绍了 PowerInfer，这是一种在配备了单个消费级 GPU 的个人计算机（PC）上运行的高速大型语言模型（LLM）推理引擎。PowerInfer 设计的关键在于利用 LLM 推理固有的高局部性，其特点是神经元激活的幂律分布。这种分布表明，一小部分神经元（称为热神经元）在不同的输入中被持续激活，而大部分神经元（称为冷神经元）则根据特定的输入而变化。PowerInfer 利用这种洞察力设计了一种 GPU-CPU 混合推理引擎：热激活神经元被预先加载到 GPU 上以实现快速访问，而冷激活神经元则在 CPU 上进行计算，从而大大减少了对 GPU 内存的需求以及 CPU-GPU 之间的数据传输。PowerInfer 进一步集成了自适应预测器和神经元感知稀疏算子，优化了神经元激活效率和计算稀疏性。评估结果表明，在各种 LLM（包括 OPT-175B）上，PowerInfer 在单个英伟达 RTX 4090 GPU 上的平均令牌生成率为 13.20 个令牌/秒，峰值为 29.08 个令牌/秒，仅比顶级服务器级 A100 GPU 低 18%。在保持模型准确性的同时，其性能比 llama.cpp 高出 11.69 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Operating Systems

自引率

0.00%

发文量

期刊最新文献

Analysis of Synchronization Mechanisms in Operating Systems Skip TLB flushes for reused pages within mmap's eBPF-mm: Userspace-guided memory management in Linux with eBPF BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects