Tools/LLM Inference & Serving/vLLM

Featured Tool

vLLM

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOffline CapableGPU Required (16GB+ VRAM)

0.0 (0)

Visit Website View on GitHub Documentation

About

vLLM is a library for high-throughput LLM inference and serving, originally developed at UC Berkeley's Sky Computing Lab. Its PagedAttention algorithm manages attention key and value memory efficiently to raise throughput and reduce waste. It supports more than 200 model architectures from Hugging Face, continuous batching, quantization, tensor parallelism, and an OpenAI-compatible server. Apache 2.0 licensed and maintained by a large contributor community.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: LLM Inference & Serving
Price: Free
Platform: Local/Desktop
Difficulty: Intermediate (3/5)
License: Apache-2.0
Minimum VRAM: 16 GB
Added: Jan 29, 2026

Tags

llm inference serving high-throughput

Related Tools

Candle

LLM Inference & Serving

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline

Advanced

0.0 (0)

Jan

LLM Inference & Serving

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Featured

llama.cpp

LLM Inference & Serving

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

PowerInfer

LLM Inference & Serving

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+

Advanced

0.0 (0)

Kobold.cpp

LLM Inference & Serving

Easy-to-use local AI inference with built-in web UI and API.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Candle

LLM Inference & Serving

Minimalist machine learning framework for Rust focused on performance and serverless inference.

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Browse all LLM Inference & Serving tools

Mentioned in

LLM Gateways in Production: LiteLLM vs Portkey vs Bifrost

A working comparison of LiteLLM, Portkey, and Bifrost on routing, caching, budgets, observability, and real...

Max P

Serving LLMs on Kubernetes: llm-d, AIBrix, and Dynamo

How llm-d, AIBrix, NVIDIA Dynamo, GPUStack, OpenLLM and Xinference actually differ on Kubernetes, and when a...

Billy C

SGLang and the Structured-Output Renaissance

Constrained generation used to be a library you bolted on. It is becoming a feature of the inference engine....

Max P