Tools/Music & Audio Generation/AudioLDM 2

AudioLDM 2

Latent diffusion model for text-to-audio, music, and speech generation.

Open SourceSelf HostedOffline CapableGPU Required (8GB+ VRAM)

0.0 (0)

About

AudioLDM 2 generates audio from text using a unified latent diffusion approach that covers text-to-audio, text-to-music, and text-to-speech within one framework built on a shared representation of sound. Developed by Haohe Liu and colleagues at CVSSP, University of Surrey, and published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, it ships seven checkpoints including a general-purpose full model, a 48kHz high-fidelity variant, a music-focused model, and speech models trained on GigaSpeech and LJSpeech. Inference runs on CUDA, CPU, or Apple MPS devices, and a Hugging Face Diffusers integration speeds generation up roughly three times while enabling arbitrary-length output. Users work through a command-line interface, a Gradio web app, or the Diffusers API, tuning guidance scale, sampling steps, and seed. The code and models are open source, and the typical audience is audio ML researchers, sound designers, and developers prototyping generative audio features.