vLLM
High-throughput LLM serving engine with PagedAttention
About
vLLM is a library for high-throughput LLM inference and serving, originally developed at UC Berkeley's Sky Computing Lab. Its PagedAttention algorithm manages attention key and value memory efficiently to raise throughput and reduce waste. It supports more than 200 model architectures from Hugging Face, continuous batching, quantization, tensor parallelism, and an OpenAI-compatible server. Apache 2.0 licensed and maintained by a large contributor community.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Category
- LLM Inference & Serving
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Intermediate (3/5)
- License
- Apache-2.0
- Minimum VRAM
- 16 GB
- Added
- Jan 29, 2026
Related Tools
Port of Meta's LLaMA model in C/C++ for efficient CPU inference
Minimalist ML framework in Rust by Hugging Face for fast inference.
Optimized inference library for running quantized LLMs on consumer GPUs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Hugging Face's high-performance text generation server
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.
Mentioned in
SGLang and the Structured-Output Renaissance
Constrained generation used to be a library you bolted on. It is becoming a feature of the inference engine....
Max P
Why Aphrodite Engine Is the Dark Horse of LLM Serving
Aphrodite Engine forks vLLM and adds the long tail of quantization formats and samplers that the...
Max P
Running Qwen3 Locally with vLLM on a Single 4090, Setup and Notes
A practical setup walkthrough for serving a Qwen3 variant locally with vLLM on a single 24GB consumer GPU,...
Billy C