Tools/LLM Inference & Serving/LMDeploy

LMDeploy

Toolkit for compressing, deploying, and serving large language models with optimized inference.

Open SourceSelf HostedOffline CapableGPU Required

0.0 (0)

View on GitHub Documentation

About

LMDeploy comes from the InternLM team as a toolkit for compressing, deploying, and serving large language models. It ships two inference engines: TurboMind, a high-performance CUDA engine with persistent batching, blocked KV cache, and optimized kernels, and a PyTorch engine aimed at flexibility and easier experimentation. Quantization support covers weight-only quantization and KV cache quantization, and the project reports up to 1.8 times higher request throughput than vLLM, with 4-bit inference running about 2.4 times faster than FP16. Coverage is broad: more than 40 LLM architectures including Llama, Qwen, DeepSeek, Gemma, and Phi, plus over 30 vision-language models such as InternVL, LLaVA, and Qwen-VL, spanning roughly 1B to 754B parameters. Multi-machine multi-GPU deployment, tensor parallelism, and automatic prefix caching round out the serving features. The toolkit is open source under Apache 2.0 and is used by teams putting LLM and VLM inference into production.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: LLM Inference & Serving
Price: Free
Platform: Local/Desktop
Difficulty: Intermediate (3/5)
License: Apache-2.0
Added: May 7, 2026

Tags

inference llm serving quantization cuda

Related Tools

Candle

LLM Inference & Serving

Minimalist ML framework in Rust by Hugging Face for fast inference.

Open SourceSelf HostedOffline

Advanced

0.0 (0)

Jan

LLM Inference & Serving

Open-source ChatGPT alternative that runs 100% offline on your computer.

Open SourceSelf HostedOffline

Beginner

0.0 (0)

Featured

llama.cpp

LLM Inference & Serving

Port of Meta's LLaMA model in C/C++ for efficient CPU inference

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

PowerInfer

LLM Inference & Serving

Fast LLM inference on consumer GPUs using neuron-aware sparse computation.

Open SourceSelf HostedOfflineGPU 4GB+

Advanced

0.0 (0)

Featured

vLLM

LLM Inference & Serving

High-throughput LLM serving engine with PagedAttention

Open SourceSelf HostedOfflineGPU 16GB+

Intermediate

0.0 (0)

Candle

LLM Inference & Serving

Minimalist machine learning framework for Rust focused on performance and serverless inference.

Open SourceSelf HostedOffline

Intermediate

0.0 (0)

Browse all LLM Inference & Serving tools

Mentioned in

The State of Open-Source LLM Inference Engines in 2026

A survey of where the major open-source LLM inference engines stand: vLLM, llama.cpp, Aphrodite, SGLang,...

Max P