CosyVoice 2

Large-scale multilingual TTS model by Alibaba with zero-shot voice cloning.

Open SourceSelf HostedOffline CapableGPU Required (8GB+ VRAM)

0.0 (0)

About

The second generation of the CosyVoice line, CosyVoice 2 is a 0.5B parameter speech synthesis model developed by the FunAudioLLM team at Alibaba. Its headline addition over the original 300M model is streaming: the model supports bidirectional streaming synthesis with first-packet latency reported as low as 150 ms while keeping quality close to offline generation, which makes it practical for live voice agents. Like the rest of the family it performs zero-shot voice cloning from a brief audio prompt, cross-lingual synthesis in which a cloned voice speaks another language, and instruction-controlled generation covering emotion, dialect, speaking rate, and volume. The model handles Chinese, English, Japanese, and Korean along with Chinese dialects, and the repository provides training and inference code, a web UI, Docker deployment, and weights on ModelScope and Hugging Face under the Apache 2.0 license. Within the same repository it has since been followed by Fun-CosyVoice 3.0, which broadens language coverage further.