PowerInfer:使用消费级 GPU 快速处理大型语言模型

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen
{"title":"PowerInfer:使用消费级 GPU 快速处理大型语言模型","authors":"Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen","doi":"arxiv-2312.12456","DOIUrl":null,"url":null,"abstract":"This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\ninference engine on a personal computer (PC) equipped with a single\nconsumer-grade GPU. The key underlying the design of PowerInfer is exploiting\nthe high locality inherent in LLM inference, characterized by a power-law\ndistribution in neuron activation. This distribution indicates that a small\nsubset of neurons, termed hot neurons, are consistently activated across\ninputs, while the majority, cold neurons, vary based on specific inputs.\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\nadaptive predictors and neuron-aware sparse operators, optimizing the\nefficiency of neuron activation and computational sparsity. Evaluation shows\nthat PowerInfer attains an average token generation rate of 13.20 tokens/s,\nwith a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a\nsingle NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier\nserver-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x\nwhile retaining model accuracy.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU\",\"authors\":\"Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen\",\"doi\":\"arxiv-2312.12456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\\ninference engine on a personal computer (PC) equipped with a single\\nconsumer-grade GPU. The key underlying the design of PowerInfer is exploiting\\nthe high locality inherent in LLM inference, characterized by a power-law\\ndistribution in neuron activation. This distribution indicates that a small\\nsubset of neurons, termed hot neurons, are consistently activated across\\ninputs, while the majority, cold neurons, vary based on specific inputs.\\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\\nadaptive predictors and neuron-aware sparse operators, optimizing the\\nefficiency of neuron activation and computational sparsity. Evaluation shows\\nthat PowerInfer attains an average token generation rate of 13.20 tokens/s,\\nwith a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a\\nsingle NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier\\nserver-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x\\nwhile retaining model accuracy.\",\"PeriodicalId\":501333,\"journal\":{\"name\":\"arXiv - CS - Operating Systems\",\"volume\":\"58 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2312.12456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.12456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文介绍了 PowerInfer,这是一种在配备了单个消费级 GPU 的个人计算机(PC)上运行的高速大型语言模型(LLM)推理引擎。PowerInfer 设计的关键在于利用 LLM 推理固有的高局部性,其特点是神经元激活的幂律分布。这种分布表明,一小部分神经元(称为热神经元)在不同的输入中被持续激活,而大部分神经元(称为冷神经元)则根据特定的输入而变化。PowerInfer 利用这种洞察力设计了一种 GPU-CPU 混合推理引擎:热激活神经元被预先加载到 GPU 上以实现快速访问,而冷激活神经元则在 CPU 上进行计算,从而大大减少了对 GPU 内存的需求以及 CPU-GPU 之间的数据传输。PowerInfer 进一步集成了自适应预测器和神经元感知稀疏算子,优化了神经元激活效率和计算稀疏性。评估结果表明,在各种 LLM(包括 OPT-175B)上,PowerInfer 在单个英伟达 RTX 4090 GPU 上的平均令牌生成率为 13.20 个令牌/秒,峰值为 29.08 个令牌/秒,仅比顶级服务器级 A100 GPU 低 18%。在保持模型准确性的同时,其性能比 llama.cpp 高出 11.69 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Analysis of Synchronization Mechanisms in Operating Systems Skip TLB flushes for reused pages within mmap's eBPF-mm: Userspace-guided memory management in Linux with eBPF BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1