Featured Tool

vLLM

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOffline CapableGPU Required (16GB+ VRAM)
0.0 (0)

About

vLLM is a library for high-throughput LLM inference and serving, originally developed at UC Berkeley's Sky Computing Lab. Its PagedAttention algorithm manages attention key and value memory efficiently to raise throughput and reduce waste. It supports more than 200 model architectures from Hugging Face, continuous batching, quantization, tensor parallelism, and an OpenAI-compatible server. Apache 2.0 licensed and maintained by a large contributor community.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Price
Free
Platform
Local/Desktop
Difficulty
Intermediate (3/5)
License
Apache-2.0
Minimum VRAM
16 GB
Added
Jan 29, 2026

Related Tools

Featured

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline
Intermediate
0.0 (0)

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline
Advanced
0.0 (0)

Optimized inference library for running quantized LLMs on consumer GPUs.

Open SourceSelf HostedOfflineGPU 6GB+
Intermediate
0.0 (0)

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline
Beginner
0.0 (0)

Hugging Face's high-performance text generation server

Open SourceSelf HostedOfflineGPU 16GB+
Advanced
0.0 (0)

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+
Advanced
0.0 (0)
Browse all LLM Inference & Serving tools