SnapFusion: Text-to-Image Diffusion Model on Mobile Devices Within Two Seconds

SnapFusion has presented a novel approach that allows for running text-to-image diffusion models on mobile devices in less than two seconds. Traditional text-to-image models are computationally expensive, requiring high-end GPUs and cloud-based inference for scale. SnapFusion’s approach involves introducing an efficient UNet and improving step distillation, allowing for user-friendly content creation while protecting user data privacy. Their extensive experiments show that their model achieves better FID and CLIP scores than Stable Diffusion v1.5 with 50 steps, democratizing content creation by bringing powerful text-to-image diffusion models to the hands of users. The proposed method unlocks running text-to-image models on mobile devices without compromising performance.

https://snap-research.github.io/SnapFusion/

To top