TabbyAPI
Fast ExLlamaV2-based OpenAI-compatible API server for quantized models.
About
TabbyAPI is a FastAPI application that serves an OpenAI-compatible API for generating text from quantized LLMs using the ExLlamaV2 and ExLlamaV3 backends, and it is the official API server for those projects. It runs EXL2 and GPTQ models efficiently on consumer GPUs with streaming and function-calling support. It is a hobby project aimed at small user counts rather than heavy production load. Released under the AGPL-3.0 license.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Category
- LLM Inference & Serving
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Easy (2/5)
- License
- AGPL-3.0
- Minimum VRAM
- 6 GB
- Added
- Apr 3, 2026
Related Tools
Port of Meta's LLaMA model in C/C++ for efficient CPU inference
High-throughput LLM serving engine with PagedAttention
Minimalist ML framework in Rust by Hugging Face for fast inference.
Optimized inference library for running quantized LLMs on consumer GPUs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.