Self-Supervised Learning from Images with JEPA

This paper presents the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a novel approach for self-supervised learning from images that does not rely on hand-crafted data augmentations. By predicting representations of target blocks in the same image from a single context block, I-JEPA aims to produce highly semantic image representations. The key to its success lies in the careful selection of target blocks and context blocks. Empirical results show that when combined with Vision Transformers, I-JEPA is highly scalable, achieving strong performance on various tasks like linear classification, object counting, and depth prediction. The surprising efficiency of training a ViT-Huge/14 on ImageNet in under 72 hours using 16 A100 GPUs highlights the potential of I-JEPA in the field.

https://arxiv.org/abs/2301.08243