voice-cloningf5-ttsttsxttscoqui

Running F5-TTS Locally for Voice Cloning, A Setup Guide

Billy C

Running F5-TTS Locally for Voice Cloning, A Setup Guide

Voice cloning open source has caught up to the point where you can stand up a credible local pipeline in an afternoon. F5-TTS is one of the most interesting recent entries, and this post is a practical setup guide based on its README, with honest comparisons to two earlier players, Coqui TTS and XTTS-v2.

For wider context on building with open source generative tools, see open source AI dev tools you should know.

What F5-TTS actually is

The F5-TTS architecture is described in its README as a Diffusion Transformer paired with ConvNeXt V2. The same repo also packages an implementation of E2 TTS, a Flat-UNet Transformer that the README calls the closest reproduction of the E2 paper. Both are flow matching speech models, which is a different family from the autoregressive transformer TTS models like XTTS.

What you need to know practically is that F5-TTS clones voices with a short reference audio clip and corresponding text, then synthesizes the same voice reading whatever you want. The README also mentions Sway Sampling, an inference-time flow step sampling strategy that the project credits as significantly improving output quality.

Installation

The README recommends a fresh conda environment on Python 3.10 or newer. The minimum sequence is a conda env, ffmpeg, the right PyTorch build for your GPU, and then either the published package or the editable repo install.

conda create -n f5-tts python=3.11
conda activate f5-tts
conda install ffmpeg

# Install PyTorch matching your CUDA / ROCm / MPS setup, then:
pip install f5-tts

If you plan to modify the code or run examples that ship in the repo, clone it instead and run pip install -e . from the project root. Apple Silicon, AMD, and Intel users follow the appropriate PyTorch install guidance; the F5-TTS package itself does not pin to NVIDIA.

Reference audio prep

This is the part that decides your output quality. The README is direct about what F5-TTS expects: a reference audio file with a corresponding text transcription. If you leave the transcription empty, F5-TTS will run an ASR model to transcribe it, at the cost of extra GPU memory.

A few practical guidelines that the README and community wiki both highlight. Use a clean clip with minimal background noise. Use a single speaker. Keep the clip to a sensible length: long enough to capture vocal characteristics, short enough that the model is not pulled in different emotional directions. The transcription should be exact, with normal punctuation; the model uses it to align acoustic features to text content during inference.

For a one-off voice clone, a good workflow is to record yourself reading a prepared paragraph at a normal pace in a quiet room with a decent USB mic. Save it as a 16 or 24 kHz wav file. The transcript is the paragraph you read, exactly.

Running inference

F5-TTS exposes both a Gradio interface and a CLI. The Gradio interface is the friendlier path:

f5-tts_infer-gradio --port 7860 --host 0.0.0.0

That gives you a browser UI where you upload reference audio and transcript, paste your generation text, and listen to the result. The README also documents a multi-style and multi-speaker mode in the Gradio app that is convenient for short dialogues.

For automation, the CLI is what you want:

f5-tts_infer-cli --model F5TTS_v1_Base \
  --ref_audio "ref.wav" \
  --ref_text "Transcription of the reference audio" \
  --gen_text "The text you want spoken in the cloned voice"

That command produces a wav file in the output directory. From there it slots into any pipeline you have for processing speech.

Output quality and setup time

In my testing the quality from a clean three to ten second reference is the headline result. Prosody, pacing, and the speaker's vocal timbre all carry well. Where F5-TTS still occasionally stumbles is on aggressive emphasis and on specialized vocabulary the model has not seen often, especially proper nouns. That is consistent with the broader category, not unique to F5-TTS.

End to end setup, from a fresh machine to first cloned output, is roughly an hour if your CUDA toolchain is already in place, more if you are setting up Python and drivers from scratch. There is no model training step in your loop; you are running inference on the published F5-TTS_v1_Base checkpoint.

How it compares to Coqui TTS and XTTS-v2

Coqui TTS is the broader toolkit. The Coqui repo is still active, with releases shipping into late 2023 and a battle-tested production reputation. It is a generalist library that contains many model families: Tacotron, Glow-TTS, FastSpeech, VITS, HiFi-GAN, and the speaker encoders used for cloning. If your need is varied, like you want to mix multilingual cloning with custom-trained mono-speaker models in a single pipeline, Coqui is the toolkit that scales to that. The repo lists more than 1,100 Fairseq-based language models that it can load.

XTTS-v2, released by the Coqui team and hosted on Hugging Face, is the specific cloning model most teams compared against until recently. It supports 17 languages including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and Hindi. The model card highlights cloning from as little as a six second clip and emphasizes cross-language voice cloning, where you take an English reference and synthesize Spanish in the same voice. The license is the Coqui Public Model License, which is non-commercial by default and worth reading carefully if you plan a commercial deployment.

The honest tradeoff is this. F5-TTS is the newer architecture and tends to produce cleaner, more natural prosody in English. XTTS-v2 has broader language coverage out of the box and a more permissive ecosystem for cross-language scenarios. Coqui TTS as a toolkit is the right pick when you want a swiss army knife and are willing to wire models together yourself. If you are doing English-first voice cloning today and starting fresh, F5-TTS is what I would set up first.

The repo for F5-TTS is at https://github.com/SWivid/F5-TTS and the README is updated frequently with new checkpoints and tooling.

Tools mentioned in this post

  • F5-TTS: flow matching TTS with Diffusion Transformer and ConvNeXt V2, supporting voice cloning from a short reference clip plus transcription.
  • Coqui TTS: general purpose TTS toolkit with Tacotron, Glow-TTS, VITS, HiFi-GAN, and multi-speaker encoders.
  • XTTS-v2: multilingual voice cloning model from Coqui supporting 17 languages and cross-language cloning from short reference audio.

Related Tools

More Articles