LLM Inference & Serving AI Tools
Open-source tools and runtimes for running large language models locally or serving them via API endpoints.
Open-source tools and runtimes for running large language models locally or serving them via API endpoints.
Desktop application for discovering, downloading, and running local LLMs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Open-source ecosystem for running LLMs locally on consumer hardware.
Single-file executable LLMs by Mozilla that run on any OS without installation.
Drop-in OpenAI-compatible API server for running LLMs, image, and audio models locally.
NVIDIA toolkit for optimizing LLM inference on NVIDIA GPUs.
Production-ready LLM serving toolkit by Hugging Face.
Optimized inference library for running quantized LLMs on consumer GPUs.
Fast inference engine for Transformer models using custom C++ runtime.
Universal LLM deployment engine for native apps on any hardware.
Fast serving framework for LLMs with structured generation and RadixAttention.
Easy-to-use local AI inference with built-in web UI and API.
Run large language models collaboratively by distributing layers across users.
Minimalist ML framework in Rust by Hugging Face for fast inference.
Fast ExLlamaV2-based OpenAI-compatible API server for quantized models.
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.
Python bindings for llama.cpp with OpenAI-compatible API server.
High-performance LLM inference engine forked from vLLM with extra features.
Lightweight inference engine for local AI with OpenAI-compatible API.