Fish Speech

Multilingual TTS with zero-shot voice cloning and streaming support.

Open SourceSelf HostedOffline CapableGPU Required (4GB+ VRAM)

0.0 (0)

About

Fish Speech is the open-source text-to-speech project from Fish Audio, built around zero-shot voice cloning: from a 10 to 30 second reference sample it reproduces a speaker's timbre, speaking style, and emotional delivery without any fine-tuning. The repository now hosts the newer S-series generation of the model, which covers more than 80 languages, supports inline delivery tags such as whispering, laughing, or a professional broadcast tone, and can place multiple speakers in a single generated output. The current architecture pairs a large slow autoregressive decoder for semantics with a small fast one for acoustic detail, refined with reinforcement learning post-training; on server GPUs the project reports a real-time factor around 0.2 with roughly 100 ms to first audio. Inference runs through a web UI, command line, or API server, with Docker images available and a GPU recommended. The code is openly available while model weights carry Fish Audio's own research license, so commercial deployments need to review its terms.