PowerInfer
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.
About
PowerInfer from Shanghai Jiao Tong University is a CPU and GPU inference engine that exploits activation locality in large language models, keeping frequently activated hot neurons on the GPU and cold neurons on the CPU. This design speeds up inference on a single consumer GPU, reporting up to an eleven-fold gain over llama.cpp for large ReLU-based models with limited VRAM. Released under the MIT license.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Category
- LLM Inference & Serving
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Advanced (4/5)
- License
- MIT
- Minimum VRAM
- 4 GB
- Added
- Apr 3, 2026
Related Tools
High-throughput LLM serving engine with PagedAttention
Minimalist ML framework in Rust by Hugging Face for fast inference.
Optimized inference library for running quantized LLMs on consumer GPUs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Hugging Face's high-performance text generation server
Port of Meta's LLaMA model in C/C++ for efficient CPU inference