In this study, Mini-Gemini is introduced as a framework to enhance Vision Language Models (VLMs) by focusing on high-resolution visual tokens, high-quality data, and VLM-guided generation. The aim is to bridge the performance gap compared to advanced models like GPT-4 and Gemini. Mini-Gemini supports a range of Language Models from 2B to 34B and has shown impressive results in zero-shot benchmarks, even outperforming private models. This framework empowers current models with image understanding, reasoning, and generation capabilities simultaneously. The code and models are available for further exploration.
https://arxiv.org/abs/2403.18814