In this paper, the authors introduce a new generic vision backbone called Vim, which utilizes bidirectional state space models (SSMs) to represent visual data. They argue that self-attention, which is commonly used in visual representation learning, is not necessary. The authors demonstrate the effectiveness of Vim by comparing its performance to well-established vision transformers like DeiT on ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks. Vim not only achieves higher performance but also shows improved computation and memory efficiency. For instance, Vim is 2.8 times faster than DeiT and saves a significant amount of GPU memory. These results suggest that Vim has the potential to become the next-generation backbone for vision foundation models.
https://arxiv.org/abs/2401.09417