VALL-E X

Cross-lingual neural codec language model for speech synthesis.

Open SourceSelf HostedOffline CapableGPU Required (8GB+ VRAM)

0.0 (0)

About

When Microsoft published the VALL-E X paper without releasing code or weights, this project reverse engineered the system and trained an open version from scratch. VALL-E X treats speech synthesis as language modeling: a GPT-style transformer predicts EnCodec quantized audio tokens from text, decoded to waveforms with Vocos. The result is zero-shot voice cloning from a 3 to 10 second sample, with support for English, Mandarin Chinese, and Japanese, cross-lingual synthesis that keeps a speaker's voice across languages, and control over emotion and accent; it even preserves the acoustic environment of the prompt recording. Generation is capped at about 22 seconds per pass by the transformer context. It needs Python 3.10, PyTorch 2.0, and roughly 6 GB of GPU VRAM, and the maintainers describe it as smaller and faster than Bark with stronger Chinese and Japanese quality. Code and the trained checkpoint are MIT licensed, drawing researchers and hobbyists who want multilingual voice cloning they can run and modify locally.