High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs

PowerInfer is a high-speed Large Language Model (LLM) inference engine that can be used on a personal computer (PC) with a consumer-grade GPU. The key to PowerInfer’s design is the exploitation of the high locality inherent in LLM inference, which is characterized by a power-law distribution in neuron activation. This means that a small subset of neurons, called hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary depending on specific inputs. PowerInfer takes advantage of this insight by preloading hot-activated neurons onto the GPU for quick access, while cold-activated neurons are computed on the CPU. This reduces GPU memory demands and CPU-GPU data transfers. PowerInfer also integrates adaptive predictors and neuron-aware sparse operators to optimize efficiency. Evaluation shows that PowerInfer achieves a high token generation rate and outperforms llama.cpp by a significant margin. It is a flexible and easy-to-use inference engine that can be deployed locally on consumer-grade hardware.

https://github.com/SJTU-IPADS/PowerInfer