Tools/Text-to-Speech (TTS)/WhisperSpeech

WhisperSpeech

Text-to-speech system built on top of Whisper encoder representations.

Open SourceSelf HostedOffline CapableGPU Required (6GB+ VRAM)

0.0 (0)

About

WhisperSpeech is an open source text-to-speech system from Collabora built by inverting Whisper: where OpenAI's model maps audio to text, WhisperSpeech runs the pipeline in reverse to generate natural speech. The architecture follows Google's SPEAR-TTS design, using semantic tokens derived from the Whisper encoder, acoustic tokens from Meta's EnCodec codec, and the Vocos vocoder for final audio. Training uses only properly licensed open speech data, starting with the English LibriLight corpus, which keeps the released models safe to build on under the MIT license. The system supports voice cloning from a short reference clip, and the team reports inference running more than ten times faster than real time on a consumer GPU. It installs from PyPI as the whisperspeech package, runs on PyTorch with CUDA, and ships Colab notebooks plus a hosted Hugging Face Space for quick testing. English is the primary language today, with multilingual support an explicit roadmap goal.