{"title":"PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU","authors":"Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen","doi":"arxiv-2312.12456","DOIUrl":null,"url":null,"abstract":"This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\ninference engine on a personal computer (PC) equipped with a single\nconsumer-grade GPU. The key underlying the design of PowerInfer is exploiting\nthe high locality inherent in LLM inference, characterized by a power-law\ndistribution in neuron activation. This distribution indicates that a small\nsubset of neurons, termed hot neurons, are consistently activated across\ninputs, while the majority, cold neurons, vary based on specific inputs.\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\nadaptive predictors and neuron-aware sparse operators, optimizing the\nefficiency of neuron activation and computational sparsity. Evaluation shows\nthat PowerInfer attains an average token generation rate of 13.20 tokens/s,\nwith a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a\nsingle NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier\nserver-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x\nwhile retaining model accuracy.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.12456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper introduces PowerInfer, a high-speed Large Language Model (LLM)
inference engine on a personal computer (PC) equipped with a single
consumer-grade GPU. The key underlying the design of PowerInfer is exploiting
the high locality inherent in LLM inference, characterized by a power-law
distribution in neuron activation. This distribution indicates that a small
subset of neurons, termed hot neurons, are consistently activated across
inputs, while the majority, cold neurons, vary based on specific inputs.
PowerInfer exploits such an insight to design a GPU-CPU hybrid inference
engine: hot-activated neurons are preloaded onto the GPU for fast access, while
cold-activated neurons are computed on the CPU, thus significantly reducing GPU
memory demands and CPU-GPU data transfers. PowerInfer further integrates
adaptive predictors and neuron-aware sparse operators, optimizing the
efficiency of neuron activation and computational sparsity. Evaluation shows
that PowerInfer attains an average token generation rate of 13.20 tokens/s,
with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a
single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier
server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x
while retaining model accuracy.