llama.cpp
Port of Meta's LLaMA model in C/C++ for efficient CPU inference
About
llama.cpp by Georgi Gerganov is a C and C++ inference engine for LLaMA-family and many other transformer language models, designed to run with minimal setup on a wide range of hardware including CPU-only laptops. It supports the GGUF quantized model format, multiple backends (CUDA, Metal, Vulkan, ROCm, BLAS), a server with an OpenAI-compatible API, and bindings for many languages. MIT licensed; the substrate for much of the local LLM ecosystem.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Category
- LLM Inference & Serving
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Intermediate (3/5)
- License
- MIT
- Added
- Jan 29, 2026
Related Tools
High-throughput LLM serving engine with PagedAttention
Minimalist ML framework in Rust by Hugging Face for fast inference.
Optimized inference library for running quantized LLMs on consumer GPUs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Hugging Face's high-performance text generation server
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.
Mentioned in
Fine-Tuning Llama 3.3 with Unsloth on a 16GB GPU, Step-by-Step
A practical, end-to-end fine-tuning walkthrough with Unsloth: dataset prep, LoRA config, 4-bit quantization,...
Billy C
The State of Open-Source LLM Inference Engines in 2026
A survey of where the major open-source LLM inference engines stand: vLLM, llama.cpp, Aphrodite, SGLang,...
Max P