OuteTTS

Pure language modeling approach to TTS without traditional audio codecs.

Open SourceSelf HostedOffline CapableGPU Required (4GB+ VRAM)

0.0 (0)

About

OuteTTS takes an unusual approach to speech synthesis, treating it purely as a language modeling problem: audio is produced through next-token prediction rather than a conventional pipeline of separate acoustic models and vocoders. The project from OuteAI ships models at 0.6B and 1B parameters across several releases up to version 1.0, supports voice cloning through custom speaker profiles created from reference audio in a few lines of code, and generates clips of roughly 42 seconds per run. It is distributed as both a Python package on PyPI and a JavaScript package on npm, and runs on many backends including llama.cpp, Hugging Face Transformers, ExLlamaV2, vLLM, and Transformers.js, with acceleration on CUDA, ROCm, Apple Metal, and Vulkan hardware. Because the models behave like ordinary language models, they also load in third-party runtimes such as KoboldCPP. The lightweight footprint and broad backend support appeal to developers who want local, open text-to-speech without heavyweight dependencies.