MetaVoice

Real-time voice cloning and TTS model with 1.2B parameters by MetaVoice.

Open SourceSelf HostedOffline CapableGPU Required (6GB+ VRAM)

0.0 (0)

About

MetaVoice-1B is a 1.2 billion parameter text-to-speech base model trained on 100,000 hours of speech, released by MetaVoice under the Apache 2.0 license with no usage restrictions. Its focus is voice cloning and expressive delivery: zero-shot cloning works for American and British accents from about 30 seconds of reference audio, while cross-lingual or other-accent cloning is achieved through fine-tuning, reportedly with as little as one minute of data for Indian-accented speakers. The model handles emotional rhythm and tone in English and supports synthesis of arbitrarily long text. Running it requires a GPU with at least 12 GB of VRAM and Python 3.10 or 3.11, and on Ampere or newer NVIDIA architectures the compiled synthesis path reaches faster than real-time generation. The repository includes fine-tuning scripts driven by simple CSV datasets of audio and captions. It suits developers who need a permissively licensed, cloning-capable TTS model they can host themselves.