LLM Inference &amp; Serving AI Tools

Unified API to call 100+ LLM providers with OpenAI format

Open SourceSelf Hosted

Easy

0.0 (0)

Featured

LM Studio

Desktop application for discovering, downloading, and running local LLMs.

Self HostedOffline

Beginner

0.0 (0)

Featured

llama.cpp

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Featured

vLLM

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOfflineGPU 16GB+

Intermediate

0.0 (0)

Featured

Ollama

Run large language models locally with a simple CLI interface

Open SourceSelf HostedOffline

Beginner

0.0 (0)

GPT4All

Open-source ecosystem for running LLMs locally on consumer hardware.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Llamafile

Single-file executable LLMs by Mozilla that run on any OS without installation.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

LocalAI

Drop-in OpenAI-compatible API server for running LLMs, image, and audio models locally.

Open SourceSelf HostedOffline

Easy

0.0 (0)

TensorRT-LLM

NVIDIA toolkit for optimizing LLM inference on NVIDIA GPUs.

Open SourceSelf HostedOfflineGPU 8GB+

Advanced

0.0 (0)

Text Generation Inference (TGI)

Production-ready LLM serving toolkit by Hugging Face.

Open SourceSelf HostedOfflineGPU 8GB+

Intermediate

0.0 (0)

CTranslate2

Fast inference engine for Transformer models using custom C++ runtime.

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

MLC LLM

Universal LLM deployment engine for native apps on any hardware.

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

SGLang

Fast serving framework for LLMs with structured generation and RadixAttention.

Open SourceSelf HostedOfflineGPU 8GB+

Intermediate

0.0 (0)

Kobold.cpp

Easy-to-use local AI inference with built-in web UI and API.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Petals

Run large language models collaboratively by distributing layers across users.

Open SourceSelf HostedGPU 4GB+

Intermediate

0.0 (0)

TabbyAPI

Fast ExLlamaV2-based OpenAI-compatible API server for quantized models.

Open SourceSelf HostedOfflineGPU 6GB+

Easy

0.0 (0)

llama-cpp-python

Python bindings for llama.cpp with OpenAI-compatible API server.

Open SourceSelf HostedOffline

Easy

0.0 (0)

Aphrodite Engine

High-performance LLM inference engine forked from vLLM with extra features.

Open SourceSelf HostedOfflineGPU 8GB+

Intermediate

0.0 (0)

Nitro

Lightweight inference engine for local AI with OpenAI-compatible API.

Open SourceSelf HostedOffline

Easy

0.0 (0)

LightLLM

Lightweight, scalable Python LLM inference and serving framework focused on high throughput.

Open SourceSelf HostedOfflineGPU

Intermediate

0.0 (0)

LMDeploy

Toolkit for compressing, deploying, and serving large language models with optimized inference.

Open SourceSelf HostedOfflineGPU

Intermediate

0.0 (0)

KTransformers

Heterogeneous CPU and GPU inference framework for very large language models on limited hardware.

Open SourceSelf HostedOfflineGPU 24GB+

Advanced

0.0 (0)

PowerInfer

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+

Advanced

0.0 (0)

Candle

Minimalist machine learning framework for Rust focused on performance and serverless inference.

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Candle

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline

Advanced

0.0 (0)

ExLlamaV2

Optimized inference library for running quantized LLMs on consumer GPUs.

Open SourceSelf HostedOfflineGPU 6GB+

Intermediate

0.0 (0)

Jan

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Text Generation Inference

Hugging Face's high-performance text generation server

Open SourceSelf HostedOfflineGPU 16GB+

Advanced

0.0 (0)