Tools/Speech-to-Text / Speech Recognition/ESPnet

ESPnet

End-to-end speech processing toolkit covering ASR, TTS, and speech translation.

Open SourceSelf HostedOffline CapableGPU Required (8GB+ VRAM)

0.0 (0)

About

ESPnet is an end-to-end speech processing toolkit built on PyTorch that covers speech recognition, text-to-speech, speech translation, speech enhancement and separation, speaker diarization, spoken language understanding, voice conversion, and singing voice synthesis. Developed by an academic community including Johns Hopkins and Carnegie Mellon, it follows Kaldi-style data conventions and ships reproducible recipes across many corpora and languages. Architectures include Conformer, Branchformer, Transformer, and transducer models for ASR with streaming support, plus Tacotron 2, FastSpeech, and VITS for TTS. Training scales across nodes via Slurm or MPI, features are extracted on the fly, and experiment tracking integrates with Weights and Biases; pretrained models from FairSeq and S3PRL can be plugged in as front ends. Released under the Apache 2.0 license, ESPnet is among the most widely used research toolkits in the speech community and also serves engineers building production systems.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: Speech-to-Text / Speech Recognition
Price: Free
Platform: Local/Desktop
Difficulty: Expert (5/5)
License: Apache-2.0
Minimum VRAM: 8 GB
Added: Apr 3, 2026

Website GitHub

Browse all Speech-to-Text / Speech Recognition tools

ESPnet

About

Reviews (0)

Leave a Review

Details

Tags

Related Tools

Conformer (ESPnet)

Insanely Fast Whisper

Kaldi

Wav2Vec 2.0

Pyannote Audio

Canary (NVIDIA NeMo)