CTranslate2

Fast inference engine for Transformer models using custom C++ runtime.

Open SourceSelf HostedOffline Capable

0.0 (0)

About

CTranslate2 is a C++ and Python library from the OpenNMT ecosystem for efficient inference with transformer models. Instead of running models through a general-purpose framework, it executes them in a custom runtime that applies weight quantization, layer fusion, padding removal, batch reordering, in-place operations, and caching to cut latency and memory use on both CPU and GPU. Supported architectures span encoder-decoder models such as Transformer, NLLB, BART, T5, and Whisper, decoder-only models including GPT-2, Llama, Mistral, Gemma, and Qwen2, and encoder-only models like BERT and XLM-RoBERTa, with converters for OpenNMT-py, OpenNMT-tf, Fairseq, Marian, and Hugging Face Transformers. Precision options cover FP16, BF16, INT16, INT8, and AWQ 4-bit quantization, letting models shrink to roughly a quarter of their disk size with minimal accuracy loss. Released under the MIT license, CTranslate2 powers production translation and transcription systems, most visibly as the engine behind faster-whisper.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

CTranslate2

About

Reviews (0)

Leave a Review

Details

Tags

Related Tools

Candle

Jan

llama.cpp

PowerInfer

vLLM

Candle