CTranslate2
Fast inference engine for Transformer models using custom C++ runtime.
About
CTranslate2 by SYSTRAN is a C++ and Python inference engine for transformer models with a custom runtime that applies weight quantization, layer fusion, and batch reordering to cut latency and memory use on CPU and GPU. It supports int8, int16, and float16 precision and powers translation, text generation, and speech models such as faster-whisper. Released under the MIT license.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Category
- LLM Inference & Serving
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Intermediate (3/5)
- License
- MIT
- Added
- Apr 3, 2026
Related Tools
Port of Meta's LLaMA model in C/C++ for efficient CPU inference
High-throughput LLM serving engine with PagedAttention
Minimalist ML framework in Rust by Hugging Face for fast inference.
Optimized inference library for running quantized LLMs on consumer GPUs.
Open-source ChatGPT alternative that runs 100% offline on your computer.
Fast LLM inference on consumer GPUs using neuron-aware sparse computation.