CogVideoX

Open-source text-to-video model by Zhipu AI/Tsinghua with 2B and 5B variants.

Open SourceSelf HostedOffline CapableGPU Required (12GB+ VRAM)

0.0 (0)

About

CogVideoX is an open text-to-video model family from Zhipu AI and Tsinghua University, descended from the earlier CogVideo research project and related to the QingYing video generation service. The line uses a diffusion transformer with a 3D causal VAE and spans several checkpoints: CogVideoX-2B and 5B generate 6 second clips at 720x480, image-to-video variants animate a still frame, and CogVideoX1.5-5B extends output to 5 to 10 seconds at resolutions up to 1360x768. With the Diffusers integration and BF16, FP16, or INT8 precision options, the 2B model can run in about 4 GB of VRAM and the larger models in roughly 5 to 10 GB, which puts video generation within reach of consumer GPUs. CogVideoX-2B is released under Apache 2.0 while the 5B models carry a custom CogVideoX license, and the ecosystem includes ComfyUI and xDiT support. Researchers and technically minded creators use it as an openly available alternative to proprietary video generation APIs.