Tools/LLM Inference & Serving/PowerInfer

PowerInfer

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOffline CapableGPU Required (4GB+ VRAM)

0.0 (0)

Visit Website View on GitHub

About

PowerInfer from Shanghai Jiao Tong University is a CPU and GPU inference engine that exploits activation locality in large language models, keeping frequently activated hot neurons on the GPU and cold neurons on the CPU. This design speeds up inference on a single consumer GPU, reporting up to an eleven-fold gain over llama.cpp for large ReLU-based models with limited VRAM. Released under the MIT license.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: LLM Inference & Serving
Price: Free
Platform: Local/Desktop
Difficulty: Advanced (4/5)
License: MIT
Minimum VRAM: 4 GB
Added: Apr 3, 2026

Tags

inference sparse neuron-aware consumer-gpu fast hybrid

Related Tools

Featured

vLLM

LLM Inference & Serving

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOfflineGPU 16GB+

Intermediate

0.0 (0)

Candle

LLM Inference & Serving

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline

Advanced

0.0 (0)

ExLlamaV2

LLM Inference & Serving

Optimized inference library for running quantized LLMs on consumer GPUs.

Open SourceSelf HostedOfflineGPU 6GB+

Intermediate

0.0 (0)

Jan

LLM Inference & Serving

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Text Generation Inference

LLM Inference & Serving

Hugging Face's high-performance text generation server

Open SourceSelf HostedOfflineGPU 16GB+

Advanced

0.0 (0)

Featured

llama.cpp

LLM Inference & Serving

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Browse all LLM Inference & Serving tools