Diffusion Training from Scratch on a Micro-Budget

In the quest to democratize training of large-scale text-to-image generative models, we propose a low-cost approach using a deferred masking strategy that significantly reduces computational cost. By incorporating mixture-of-experts layers and utilizing synthetic images, we trained a 1.16 billion parameter sparse transformer for only $1,890, achieving competitive results with a 12.7 FID on the COCO dataset. Surprisingly, our model outperforms stable diffusion models and the current state-of-the-art at a fraction of the cost, emphasizing the importance of accessibility in AI development. We plan to share our training pipeline to empower others to train similar models on micro-budgets.

https://arxiv.org/abs/2407.15811

To top