Tools/Speech-to-Text / Speech Recognition

Speech-to-Text / Speech Recognition AI Tools

Open-source automatic speech recognition (ASR) models and tools for transcribing and translating audio.

Automatic speech recognition is the plumbing behind meeting summarizers, subtitle pipelines, call analytics, voice agents, and the corpora used to train other models. Teams reaching for these projects optimize three numbers: word error rate on their real audio, latency to first token, and cost per audio hour. By 2026 the open field has settled into three camps. The largest is the Whisper family, where Whisper, Whisper.cpp, Distil-Whisper, and Whisper JAX wrap one strong multilingual checkpoint in different runtimes and inherit its batch oriented 30 second window. Against that sit the streaming and edge projects, Moonshine, Sherpa-ONNX STT, Vosk, and WeNet, which trade accuracy on noisy or accented speech for a small footprint and immediate partial results. The third camp, ESPnet, Kaldi, SpeechBrain, and NVIDIA NeMo ASR, offers training toolkits rather than ready checkpoints, worth the setup only when a domain needs fine tuning. Whisper.cpp is the sane first stop for offline batch work, running quantized weights on CPU with no Python stack, and Distil-Whisper slots in when throughput matters more than the last point of accuracy. Live audio calls for Moonshine or RealtimeSTT, with Pyannote Audio as the usual companion for speaker diarization, which the ASR models do not handle. The trap to check early is that model weights are licensed separately from repository code: NVIDIA's Canary and Parakeet checkpoints carry Creative Commons terms, noncommercial in some cases, while the NeMo code around them is Apache. Maintenance status deserves the same scrutiny, since DeepSpeech still runs but Mozilla archived it.

Speech-to-Text / Speech Recognition AI Tools

Whisper

Whisper.cpp

ESPnet

Insanely Fast Whisper

Kaldi

Wav2Vec 2.0

Pyannote Audio

Whisper Timestamped

Sherpa-ONNX STT

NVIDIA NeMo ASR

Paraformer (FunASR)

Silero Models

SpeechBrain

Vosk

Whisper JAX

DeepSpeech

Distil-Whisper

FunASR (Paraformer)

FireRedASR

Moonshine

NVIDIA Parakeet

RealtimeSTT

Reverb

SenseVoice

Silero VAD

WeNet

Voxtral

Canary (NVIDIA NeMo)

WhisperKit

Conformer (ESPnet)

Filters