LLM Inference & Serving AI Tools
Open-source tools and runtimes for running large language models locally or serving them via API endpoints.
Open-source tools and runtimes for running large language models locally or serving them via API endpoints.
Desktop application for discovering, downloading, and running local LLMs.
Port of Meta's LLaMA model in C/C++ for efficient CPU inference
High-throughput LLM serving engine with PagedAttention
Run large language models locally with a simple CLI interface
Open-source ecosystem for running LLMs locally on consumer hardware.
Single-file executable LLMs by Mozilla that run on any OS without installation.
Drop-in OpenAI-compatible API server for running LLMs, image, and audio models locally.
NVIDIA toolkit for optimizing LLM inference on NVIDIA GPUs.
Production-ready LLM serving toolkit by Hugging Face.
Fast inference engine for Transformer models using custom C++ runtime.
Universal LLM deployment engine for native apps on any hardware.
Fast serving framework for LLMs with structured generation and RadixAttention.
Easy-to-use local AI inference with built-in web UI and API.
Run large language models collaboratively by distributing layers across users.
Fast ExLlamaV2-based OpenAI-compatible API server for quantized models.
Python bindings for llama.cpp with OpenAI-compatible API server.
High-performance LLM inference engine forked from vLLM with extra features.
Lightweight inference engine for local AI with OpenAI-compatible API.
Lightweight, scalable Python LLM inference and serving framework focused on high throughput.
Toolkit for compressing, deploying, and serving large language models with optimized inference.
Heterogeneous CPU and GPU inference framework for very large language models on limited hardware.
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.
Minimalist machine learning framework for Rust focused on performance and serverless inference.
Minimalist ML framework in Rust by Hugging Face for fast inference.
Optimized inference library for running quantized LLMs on consumer GPUs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Hugging Face's high-performance text generation server