In this paper, the authors introduce the Hourglass Diffusion Transformer (HDiT), a powerful image generative model that can operate at high resolutions directly in pixel space. The HDiT combines the efficiency of convolutional U-Nets with the scalability of Transformers, bridging the gap between the two. Surprisingly, the HDiT does not require typical high-resolution training techniques such as multiscale architectures or latent autoencoders. The authors demonstrate that the HDiT performs competitively with existing models on ImageNet and sets a new state-of-the-art for diffusion models on FFHQ. Additionally, the HDiT achieves impressive efficiency, incurring less than 1% of the computational cost compared to a standard diffusion transformer of similar size. Overall, this research presents a novel approach to high-resolution image synthesis.
https://crowsonkb.github.io/hourglass-diffusion-transformers/