TensorRT-LLM

NVIDIA toolkit for optimizing LLM inference on NVIDIA GPUs.

Open SourceSelf HostedOffline CapableGPU Required (8GB+ VRAM)

0.0 (0)

About

TensorRT-LLM is NVIDIA's open-source library for maximizing inference performance of large language models and visual generation models on NVIDIA GPUs. It pairs specialized CUDA kernels for attention, matrix multiplication, and mixture-of-experts computation with an efficient runtime offering in-flight batching, paged KV cache, speculative decoding, and prefill-decode disaggregation. Built on PyTorch with a high-level Python API, it supports quantization down to INT4, FP8, and FP4 precision and scales from a single GPU to multi-node clusters using tensor, pipeline, and expert parallelism. Predefined configurations cover popular architectures including Llama, DeepSeek, and Mixtral, and the library integrates with NVIDIA Triton Inference Server and Dynamo for deployment. It targets data center GPUs such as H100, H200, and B200 as well as consumer RTX cards. Released under the Apache 2.0 license, it is used by ML engineers and enterprises that need high-throughput, low-latency LLM serving on NVIDIA hardware.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

TensorRT-LLM

About

Reviews (0)

Leave a Review

Details

Tags

Related Tools

Candle

Jan

llama.cpp

PowerInfer

vLLM

Candle