Tools/LLM Inference & Serving/Text Generation Inference

Text Generation Inference

Hugging Face's high-performance text generation server

Open SourceSelf HostedOffline CapableGPU Required (16GB+ VRAM)

0.0 (0)

Visit Website View on GitHub Documentation

About

Text Generation Inference by Hugging Face is a Rust, Python, and gRPC server for deploying and serving large language models, used in production to power Hugging Chat and the Inference API. It implements optimized generation for popular open models such as Llama, Falcon, StarCoder, and GPT-NeoX, with tensor parallelism, continuous batching, and an OpenAI-compatible messages API. Distributed as an official Docker container.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: LLM Inference & Serving
Price: Free
Platform: Local/Desktop
Difficulty: Advanced (4/5)
License: Apache-2.0
Minimum VRAM: 16 GB
Added: Jan 29, 2026

Tags

llm inference huggingface serving

Related Tools

Candle

LLM Inference & Serving

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline

Advanced

0.0 (0)

Jan

LLM Inference & Serving

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Featured

llama.cpp

LLM Inference & Serving

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

PowerInfer

LLM Inference & Serving

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+

Advanced

0.0 (0)

Featured

vLLM

LLM Inference & Serving

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOfflineGPU 16GB+

Intermediate

0.0 (0)

Candle

LLM Inference & Serving

Minimalist machine learning framework for Rust focused on performance and serverless inference.

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Browse all LLM Inference & Serving tools