LightLLM

Lightweight, scalable Python LLM inference and serving framework focused on high throughput.

Open SourceSelf HostedOffline CapableGPU Required

0.0 (0)

About

ModelTC's LightLLM is a pure Python inference and serving framework for large language models, built around a lightweight and easily modified codebase that still targets high throughput. Asynchronous request processing and token level KV cache management keep GPU memory utilization high under load, and the framework adds constrained decoding for structured output, SLA aware request scheduling, and prefix KV cache transfer across distributed ranks. Its design borrows proven techniques from FasterTransformer, TGI, vLLM, and FlashAttention, and the 1.0 release reported strong DeepSeek-R1 serving performance on a single H200 machine. Because the stack is Python end to end, research groups adapt it readily, and systems work built on LightLLM has appeared at venues including OSDI, SOSP, ASPLOS, and ACL. Organizations serving models at scale and researchers prototyping scheduling or caching ideas are the typical users. Released under the Apache 2.0 license.