SGLang
Fast serving framework for LLMs with structured generation and RadixAttention.
About
SGLang by LMSYS is a serving framework for large language and multimodal models that delivers low-latency, high-throughput inference from a single GPU to large clusters. Its RadixAttention reuses key-value cache across requests, and it supports constrained decoding for JSON and regex output plus structured programming of model calls. Released under the Apache 2.0 license.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Category
- LLM Inference & Serving
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Intermediate (3/5)
- License
- Apache-2.0
- Minimum VRAM
- 8 GB
- Added
- Apr 3, 2026
Related Tools
Port of Meta's LLaMA model in C/C++ for efficient CPU inference
High-throughput LLM serving engine with PagedAttention
Minimalist ML framework in Rust by Hugging Face for fast inference.
Optimized inference library for running quantized LLMs on consumer GPUs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.
Mentioned in
SGLang and the Structured-Output Renaissance
Constrained generation used to be a library you bolted on. It is becoming a feature of the inference engine....
Max P
Why Aphrodite Engine Is the Dark Horse of LLM Serving
Aphrodite Engine forks vLLM and adds the long tail of quantization formats and samplers that the...
Max P
The State of Open-Source LLM Inference Engines in 2026
A survey of where the major open-source LLM inference engines stand: vLLM, llama.cpp, Aphrodite, SGLang,...
Max P