Pali-3 Vision Language Models: Smaller, Faster, Stronger

Authors Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut present PaLI-3, a more efficient vision language model (VLM) that outperforms larger models. In their study, they compare Vision Transformer (ViT) models pretrained using classification objectives with contrastively pretrained models (SigLIP). While the SigLIP-based PaLI slightly underperforms in standard image classification, it excels in multimodal benchmarks, particularly in localization and visually-situated text understanding. Scaling the SigLIP image encoder up to 2 billion parameters, they achieve a new state-of-the-art in multilingual cross-modal retrieval. The authors hope that PaLI-3, with only 5 billion parameters, will inspire research on key aspects of complex VLMs and pave the way for larger models.

https://arxiv.org/abs/2310.09199

To top