The convolution empire strikes back

The latest research from DeepMind challenges the common belief that transformers, specifically Vision Transformers (ViT), outperform convolutional networks (ConvNets) on extremely large datasets. Previous studies comparing ViT with ConvNets used weak baselines and large computational budgets, making the results questionable. In this study, the authors use Normalizer-Free ResNets (NFNet), a purely convolutional architecture, to achieve state-of-the-art results on ImageNet. They show that NFNet models, when fine-tuned using Sharpness-Aware Minimization, perform on par with ViT models of comparable budgets. Interestingly, the authors observe that models with the lowest validation loss don’t always yield the best performance post fine-tuning. Overall, computational power and data are still important, but models have their own biases. ViT may be more suitable in certain situations.

https://gonzoml.substack.com/p/the-convolution-empire-strikes-back