Tools/LLM Inference & Serving/TabbyAPI

TabbyAPI

Fast ExLlamaV2-based OpenAI-compatible API server for quantized models.

Open SourceSelf HostedOffline CapableGPU Required (6GB+ VRAM)

0.0 (0)

Visit Website View on GitHub

About

TabbyAPI is a FastAPI application that serves an OpenAI-compatible API for generating text from quantized LLMs using the ExLlamaV2 and ExLlamaV3 backends, and it is the official API server for those projects. It runs EXL2 and GPTQ models efficiently on consumer GPUs with streaming and function-calling support. It is a hobby project aimed at small user counts rather than heavy production load. Released under the AGPL-3.0 license.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: LLM Inference & Serving
Price: Free
Platform: Local/Desktop
Difficulty: Easy (2/5)
License: AGPL-3.0
Minimum VRAM: 6 GB
Added: Apr 3, 2026

Tags

inference api exllamav2 quantized openai-compatible fast

Related Tools

Featured

llama.cpp

LLM Inference & Serving

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Featured

vLLM

LLM Inference & Serving

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOfflineGPU 16GB+

Intermediate

0.0 (0)

Candle

LLM Inference & Serving

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline

Advanced

0.0 (0)

ExLlamaV2

LLM Inference & Serving

Optimized inference library for running quantized LLMs on consumer GPUs.

Open SourceSelf HostedOfflineGPU 6GB+

Intermediate

0.0 (0)

Jan

LLM Inference & Serving

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

PowerInfer

LLM Inference & Serving

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+

Advanced

0.0 (0)

Browse all LLM Inference & Serving tools