Tools/LLM Inference & Serving/Text Generation Inference (TGI)

Text Generation Inference (TGI)

Production-ready LLM serving toolkit by Hugging Face.

Open SourceSelf HostedOffline CapableGPU Required (8GB+ VRAM)

0.0 (0)

About

Text Generation Inference, better known as TGI, is Hugging Face's production toolkit for serving large language models, built in Rust and Python with gRPC internals and proven as the engine behind Hugging Chat and Hugging Face inference services. Its serving stack combines continuous batching, tensor parallelism across GPUs, Flash Attention and Paged Attention kernels, token streaming over Server-Sent Events, and structured or JSON-constrained generation, with quantization support spanning bitsandbytes, GPTQ, AWQ, Marlin, EETQ, and fp8. An OpenAI-compatible Messages API eases migration from hosted APIs, and OpenTelemetry tracing plus Prometheus metrics cover observability. Hardware support extends beyond NVIDIA to AMD, AWS Inferentia, Intel GPUs, Gaudi, and Google TPUs, serving architectures like Llama, Falcon, StarCoder, and BLOOM. The project is Apache 2.0 licensed and now in maintenance mode, accepting mainly bug fixes while Hugging Face points new deployments toward vLLM and SGLang, though many production systems still run TGI.