llm-inferencevllmllama-cppsglangself-hosting

The State of Open-Source LLM Inference Engines in 2026

Max P

The State of Open-Source LLM Inference Engines in 2026

Self-hosted LLM serving has matured a lot in the last two years. There is no single "best" engine anymore. There is a small set of mature open-source projects, each with a clear personality. This post is a tour of the ones I see in production today: vLLM, llama.cpp, Aphrodite, SGLang, LMDeploy, and LightLLM.

I will not throw made-up tokens-per-second figures at you. The honest answer is "it depends on the model, the hardware, and the workload." What matters is which engine is the right shape for your problem.

What changed since the early days

A few years ago, "serving an LLM" meant either a Hugging Face transformers script or a quick wrapper around a vendor API. Today, the open-source side has converged on a set of shared techniques:

  • KV cache management, usually some flavor of paged attention.
  • Continuous batching of incoming requests instead of static batches.
  • Quantization at multiple precisions, including INT8, INT4, FP8, and GGUF.
  • An OpenAI-compatible HTTP API, because everyone's client code already speaks that protocol.

What differs between engines is the target hardware, the depth of features around scheduling and structured output, and the priorities of the maintaining team. If you are deciding between any of these, that is what to focus on.

For a broader view of where the developer ecosystem is, my colleague's piece on The State of AI Developer Tools 2026 is a good companion read.

vLLM: the throughput-oriented default

vLLM was the project that made paged attention famous. It is Apache-2.0 licensed, originated at UC Berkeley's Sky Computing Lab, and is now maintained by a large open-source community.

The README is unambiguous about what it optimizes for: "efficient management of attention key and value memory with PagedAttention," continuous batching, speculative decoding, and a wide quantization matrix that includes "FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF" and others. Hardware support spans NVIDIA and AMD GPUs, x86, ARM, and PowerPC CPUs, plus TPUs, Intel Gaudi, and Apple Silicon.

When I see a team standing up a serving layer for a chatbot or RAG system on NVIDIA hardware, vLLM is almost always the default choice. The OpenAI-compatible server makes it a drop-in for almost any client library. See the vLLM repo for the full feature matrix.

llama.cpp: the everywhere engine

llama.cpp is the other heavyweight, but the personality is completely different. It is MIT-licensed, written in plain C and C++ with no dependencies, and built around the GGUF model format. The whole point is to run on whatever you have.

That includes ARM NEON and AVX/AVX2/AVX512 on CPUs, NVIDIA CUDA, AMD HIP, Apple Metal, Intel SYCL, plus Vulkan, OpenCL, and WebGPU. It even supports CPU plus GPU hybrid inference for models that exceed your VRAM. Quantization runs from 1.5-bit through 8-bit integer formats.

llama.cpp is what you reach for when the question is "I have this laptop or this slightly weird server, can it run a model?" The answer is almost always yes. If you want absolute throughput on a fleet of H100s, you would not pick llama.cpp. If you want a Mac mini, a Raspberry Pi, or a dusty Xeon to serve a model, this is the project.

Aphrodite: a fork with serving in mind

Aphrodite Engine sits in an interesting place. According to its repository, it is "Built on vLLM's Paged Attention technology" and is licensed AGPL-3.0. It powers PygmalionAI's chat platforms and emphasizes "high-performance model inference for multiple concurrent users."

Where Aphrodite differs from upstream vLLM is breadth of quantization formats and sampling features. It lists support for AQLM, AutoRound, AWQ, BitNet, Bitsandbytes, ExLlamaV3, GGUF, GPTQ, QuIP#, SqueezeLLM, Marlin, plus quantized KV cache. On the sampling side it ships DRY, XTC, and Mirostat.

If your workload is an interactive chat or roleplay product where you want a wide range of quantized open-weights models and richer sampler controls, Aphrodite is a natural pick. The AGPL license is the catch; check it carefully if you embed Aphrodite into a closed product.

SGLang: structured outputs and prefix caching

SGLang is Apache-2.0 licensed and pitches itself as a high-performance serving framework for "low-latency and high-throughput inference for large language models and multimodal models." Its signature feature is RadixAttention, a prefix-caching scheme that reuses key-value pairs across requests with shared prefixes.

For workloads with heavy prefix overlap, like agents that share long system prompts, this is a real win. SGLang also ships structured outputs, FP4, FP8, INT4, AWQ, and GPTQ quantization, and a zero-overhead CPU scheduler. It supports NVIDIA, AMD, Intel Xeon CPUs, Google TPUs, and other accelerators per the README.

The framework's adoption tells its own story: deployed at xAI, AMD, NVIDIA, Intel, and major cloud providers. If you are building a system where every request shares a long prefix, SGLang is worth benchmarking against vLLM specifically.

LMDeploy: a serving toolkit with strong quantization

LMDeploy describes itself as "a toolkit for compressing, deploying, and serving LLM," developed by the MMRazor and MMDeploy teams. Apache-2.0. It pairs two backends: TurboMind, which is the highly optimized C plus CUDA path, and a PyTorch path for easier experimentation.

LMDeploy emphasizes weight-only and KV quantization with a focus on 4-bit inference, persistent batching, blocked KV cache, and multi-machine, multi-GPU distribution. Hardware support includes NVIDIA GPUs (CUDA 12+), AMD ROCm, Intel GPUs, Huawei Ascend, and Mac processors. It supports 60-plus language models and 40-plus vision-language models including the InternVL, LLaVA, and Qwen-VL families.

If you are serving large vision-language models at scale on NVIDIA hardware, LMDeploy and SGLang are both worth a hard look against vLLM.

LightLLM: the lightweight, research-friendly option

LightLLM is Apache-2.0 and described in its README as "a Python-based LLM inference and serving framework" with a "lightweight design, easy scalability, and high-speed performance." It uses token-level KV cache management and supports constrained decoding with deterministic pushdown automata, plus prefix KV cache transfer between distributed ranks.

The pure-Python design makes it appealing for research where you want to instrument the scheduler or the cache. For production at huge scale you would more often see vLLM or SGLang. For a small team doing serving research, LightLLM is a reasonable starting point.

How to choose

A rough decision tree:

  • Mainstream NVIDIA serving with maximum throughput and ecosystem support: vLLM.
  • Run anywhere, especially CPU and Apple Silicon, prefer GGUF: llama.cpp.
  • Heavy prefix sharing or strong structured-output needs: SGLang.
  • Vision-language at scale on NVIDIA: LMDeploy.
  • Wide quantization matrix and sampler richness for a chat product: Aphrodite (mind AGPL).
  • Research on serving systems in pure Python: LightLLM.

You will likely run more than one. A small team I work with serves a 4-bit GGUF model with llama.cpp for a desktop app and vLLM for the cloud API, sharing a single OpenAI-compatible client. The category is mature enough now to mix and match.

Tools mentioned in this post

  • vLLM: high-throughput Apache-2.0 inference engine built around PagedAttention with broad hardware and quantization support.
  • llama.cpp: MIT-licensed C/C++ inference engine built around GGUF, runs on practically any hardware.
  • Aphrodite Engine: AGPL-3.0 inference engine built on vLLM's PagedAttention with extended quantization and sampler support.
  • SGLang: Apache-2.0 serving framework with RadixAttention prefix caching and strong structured-output support.
  • LMDeploy: Apache-2.0 toolkit with TurboMind and PyTorch backends, strong 4-bit and KV quantization, broad VLM support.

Related Tools

More Articles