Vision Transformers Need Registers

In this paper, the authors commit to the ICLR Code of Ethics and present a solution to artifacts in feature maps of ViT networks. They introduce new tokens (“registers”) to address high-norm tokens in low-informative areas, improving model performance and feature map quality. This solution sets a new state of the art for self-supervised visual models and enhances object discovery methods. Surprisingly, the addition of tokens leads to smoother feature and attention maps, benefitting downstream visual processing. The submission complies with guidelines, and the authors maintain anonymity. This work focuses on unsupervised, self-supervised, semi-supervised, and supervised representation learning, showcasing innovative and impactful research.

https://openreview.net/forum?id=2dnO3LLiJ1

To top