Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer by Enze Xie and team introduces Sana, a text-to-image framework that generates high-quality images up to 4096 × 4096 resolution at a remarkably fast speed, deployable on a laptop GPU. The core designs include a deep compression autoencoder, efficient linear DiT, decoder-only small LLM as a text encoder, and efficient training and sampling strategies. Surprisingly, Sana-0.6B competes with modern giant diffusion models like Flux-12B while being 20 times smaller and 100+ times faster in throughput. Sana-0.6B can generate a high-resolution image in less than 1 second on a 16GB laptop GPU, enabling cost-effective content creation.

https://nvlabs.github.io/Sana/

To top