Moshi

Speech-text foundation model for full-duplex real-time spoken dialogue with neural audio codec.

Open SourceSelf HostedOffline CapableGPU Required (24GB+ VRAM)

0.0 (0)

About

Moshi, from Kyutai Labs, is a speech-text foundation model built for full-duplex spoken dialogue: it models the user's audio and its own audio as separate parallel streams, so it can listen and speak at the same time instead of taking turns. Architecturally it pairs a 7B parameter Temporal Transformer with a smaller Depth Transformer that handles inter-codebook structure, reaching a theoretical latency of 160 ms and about 200 ms in practice on an L4 GPU. Audio flows through Mimi, the project's streaming neural codec, which represents 24 kHz audio at 12.5 Hz frames and 1.1 kbps with 80 ms latency, a frame rate close to text token rates that keeps autoregressive steps manageable. The repository ships three backends: PyTorch for research, MLX for Apple silicon, and Rust with Candle for production serving. Code is released under MIT (Python) and Apache (Rust) licenses with model weights under CC-BY 4.0, making it a key open reference implementation for real-time voice AI.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Category: Text-to-Speech (TTS)
Price: Free
Platform: Local/Desktop
Difficulty: Advanced (4/5)
License: Apache-2.0
Minimum VRAM: 24 GB
Added: May 7, 2026

0.0 (0)

Website GitHub

Featured

Bark

Text-to-Speech (TTS)

Transformer-based text-to-audio model by Suno that generates speech, music, and sound effects.

Open SourceSelf HostedOfflineGPU 4GB+

Intermediate

0.0 (0)

Website GitHub

Browse all Text-to-Speech (TTS) tools

Mentioned in

Beyond Whisper: Parakeet, SenseVoice and ASR in 2026

Whisper is no longer the default: how Parakeet, SenseVoice, Kimi-Audio, Ultravox and Moshi compare on...

Max P

Moshi

About

Reviews (0)

Leave a Review

Details

Tags

Related Tools

Kokoro TTS

ChatTTS

CosyVoice

CosyVoice 2

EmotiVoice

Bark

Mentioned in