Tools/LLM Inference & Serving/SGLang

SGLang

Fast serving framework for LLMs with structured generation and RadixAttention.

Open SourceSelf HostedOffline CapableGPU Required (8GB+ VRAM)

0.0 (0)

Visit Website View on GitHub

About

SGLang is a serving framework for large language models and vision-language models, developed under LMSYS, focused on low-latency, high-throughput inference. Its runtime combines RadixAttention prefix caching, continuous batching, paged attention, a zero-overhead CPU scheduler, prefill-decode disaggregation, and speculative decoding to scale from a single GPU to large multi-node clusters. Structured output is a distinguishing feature: constrained decoding driven by a compressed finite state machine produces JSON and regex-constrained text substantially faster than conventional approaches. The framework supports model families including Llama, Qwen, and DeepSeek along with embedding and diffusion models, exposes an OpenAI-compatible API, and offers FP8, FP4, and INT4 quantization. Hardware support spans NVIDIA and AMD GPUs, Intel Xeon CPUs, Google TPUs, and Ascend NPUs. The project reports production deployments across hundreds of thousands of GPUs at companies including xAI and major cloud providers, and the code is open source under the Apache 2.0 license.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: LLM Inference & Serving
Price: Free
Platform: Local/Desktop
Difficulty: Intermediate (3/5)
License: Apache-2.0
Minimum VRAM: 8 GB
Added: Apr 3, 2026

Tags

inference serving structured json fast radix-attention

Related Tools

Candle

LLM Inference & Serving

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline

Advanced

0.0 (0)

Jan

LLM Inference & Serving

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Featured

llama.cpp

LLM Inference & Serving

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

PowerInfer

LLM Inference & Serving

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+

Advanced

0.0 (0)

Featured

vLLM

LLM Inference & Serving

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOfflineGPU 16GB+

Intermediate

0.0 (0)

Candle

LLM Inference & Serving

Minimalist machine learning framework for Rust focused on performance and serverless inference.

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Browse all LLM Inference & Serving tools

Mentioned in

LLM Gateways in Production: LiteLLM vs Portkey vs Bifrost

A working comparison of LiteLLM, Portkey, and Bifrost on routing, caching, budgets, observability, and real...

Max P

Serving LLMs on Kubernetes: llm-d, AIBrix, and Dynamo

How llm-d, AIBrix, NVIDIA Dynamo, GPUStack, OpenLLM and Xinference actually differ on Kubernetes, and when a...

Billy C

SGLang and the Structured-Output Renaissance

Constrained generation used to be a library you bolted on. It is becoming a feature of the inference engine....

Max P