CosyVoice

Multilingual large voice generation model with full-stack inference, training, and deployment.

Open SourceSelf HostedOffline CapableGPU Required

0.0 (0)

Visit Website View on GitHub Documentation

About

CosyVoice is a family of large text-to-speech models from the FunAudioLLM team that treats speech synthesis as a language modeling problem. The current Fun-CosyVoice 3.0 release, a 0.5B parameter model, covers nine languages including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian, plus more than 18 Chinese dialects and accents, and supports zero-shot voice cloning from a short reference sample, including cross-lingual cloning. Streaming synthesis works in both directions, text in and audio out, with latency as low as about 150 ms, and instruction inputs control language, dialect, emotion, speaking rate, and volume. Pronunciation inpainting accepts Chinese Pinyin and English CMU phonemes for precise reads, and reported benchmarks include a 0.81 percent character error rate on Chinese and a 1.68 percent word error rate on English test sets. The open source project ships a web UI, Docker images, and a Python API, with weights on ModelScope and Hugging Face, serving developers building production voice features.