DeepFloyd IF

Pixel-space diffusion model by DeepFloyd/Stability AI with strong text rendering.

Open SourceSelf HostedOffline CapableGPU Required (16GB+ VRAM)

0.0 (0)

About

DeepFloyd IF is a text-to-image model from the DeepFloyd Lab at Stability AI that generates in pixel space rather than latent space, using a cascade of three stages: a base model producing 64x64 images, an upscaler to 256x256, and a final upscaler reaching 1024x1024. All stages condition on a frozen T5-XXL text encoder through cross-attention in UNet architectures, which gives the model unusually strong prompt understanding and legible text rendering inside images. It reports a zero-shot FID score of 6.66 on COCO. Running the full pipeline takes about 24 GB of VRAM, the first two stages need 16 GB, and CPU offloading brings requirements down to roughly 14 GB. The code is available under a modified MIT license while the model weights carry the separate DeepFloyd IF license, which initially restricted use to research. Researchers and practitioners exploring high-fidelity generation, super-resolution, and inpainting are the main audience.