ExLlamaV2
Optimized inference library for running quantized LLMs on consumer GPUs.
About
ExLlamaV2 is an inference library for running quantized language models on modern consumer NVIDIA GPUs, using custom CUDA kernels for speed. It supports GPTQ and the EXL2 format, dynamic batching, and speculative decoding, and its recommended server backend is TabbyAPI, which adds an OpenAI-compatible API. The repository is archived as development moves to ExLlamaV3. Released under the MIT license.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Category
- LLM Inference & Serving
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Intermediate (3/5)
- License
- MIT
- Minimum VRAM
- 6 GB
- Added
- Apr 3, 2026
Related Tools
Port of Meta's LLaMA model in C/C++ for efficient CPU inference
High-throughput LLM serving engine with PagedAttention
Minimalist ML framework in Rust by Hugging Face for fast inference.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Hugging Face's high-performance text generation server
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.