ExLlamaV2

Optimized inference library for running quantized LLMs on consumer GPUs.

Open SourceSelf HostedOffline CapableGPU Required (6GB+ VRAM)
0.0 (0)

About

ExLlamaV2 is an inference library for running quantized language models on modern consumer NVIDIA GPUs, using custom CUDA kernels for speed. It supports GPTQ and the EXL2 format, dynamic batching, and speculative decoding, and its recommended server backend is TabbyAPI, which adds an OpenAI-compatible API. The repository is archived as development moves to ExLlamaV3. Released under the MIT license.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Price
Free
Platform
Local/Desktop
Difficulty
Intermediate (3/5)
License
MIT
Minimum VRAM
6 GB
Added
Apr 3, 2026

Related Tools

Featured

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline
Intermediate
0.0 (0)
Featured

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOfflineGPU 16GB+
Intermediate
0.0 (0)

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline
Advanced
0.0 (0)

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline
Beginner
0.0 (0)

Hugging Face's high-performance text generation server

Open SourceSelf HostedOfflineGPU 16GB+
Advanced
0.0 (0)

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+
Advanced
0.0 (0)
Browse all LLM Inference & Serving tools