In this work, the authors delve into the creation of high-performing Multimodal Large Language Models (MLLMs) by analyzing architecture components and data choices. They emphasize the importance of using a diverse mix of image-caption, interleaved image-text, and text-only data for optimal few-shot results. Surprisingly, they find that the image encoder and its specifications have a significant impact on performance, while the design of the vision-language connector is less crucial. By scaling their approach, they introduce MM1, a family of multimodal models with up to 30B parameters that excel in pre-training metrics and demonstrate competitive results on various multimodal benchmarks.
https://arxiv.org/abs/2403.09611