HunyuanDiT

Bilingual text-to-image diffusion transformer by Tencent with Chinese and English support.

Open SourceSelf HostedOffline CapableGPU Required (12GB+ VRAM)

0.0 (0)

About

Hunyuan-DiT is Tencent's open text-to-image diffusion transformer, notable for fine-grained understanding of both Chinese and English prompts. It runs diffusion in latent space with a transformer backbone and a pretrained VAE, encoding text through a combination of bilingual CLIP and multilingual T5, with captions refined by a dedicated multimodal language model for better semantic alignment. The main v1.2 release has 1.5 billion parameters, joined by distilled checkpoints with roughly 50 percent faster inference and a smaller 0.7B HunyuanDiT-S variant. Multi-turn conversational generation lets users refine an image iteratively through dialogue. LoRA fine-tuning and ControlNet adapters are supported, and the model is usable through Hugging Face Diffusers, ComfyUI, and a Gradio interface. Generation needs about 11 GB of GPU VRAM at minimum. The weights are distributed under the Tencent Hunyuan Community License, and community checkpoints extend it to domains like anime and inpainting.