Parler-TTS

TTS model that generates speech from text descriptions of the desired voice.

Open SourceSelf HostedOffline CapableGPU Required (6GB+ VRAM)

0.0 (0)

About

Rather than selecting from fixed voices, Parler-TTS generates speech from a plain-language description of the desired speaker, such as a female voice with a warm tone speaking quickly. Maintained by Hugging Face, the project reproduces research by Lyth and King from Stability AI and the University of Edinburgh on natural language guidance of text-to-speech, and it is fully open: training code, datasets, and model weights are all released under Apache 2.0. Two checkpoints trained on 45,000 hours of audiobook audio shipped in August 2024, Mini at 880M parameters and Large at 2.3B, along with 34 named speakers for consistent voice reproduction across generations. Descriptions can specify gender, pitch, speaking rate, and even recording quality, while punctuation shapes prosody. Compatibility with SDPA, Flash Attention 2, and model compilation speeds up inference. Researchers and developers use it as a controllable, reproducible alternative to closed text-to-speech APIs.