In this groundbreaking study, the LLaVA-o1 Vision-Language Model showcases impressive advancements in reasoning capabilities, surpassing even larger and closed-source models in performance on multimodal reasoning tasks. Unlike traditional prompting methods, LLaVA-o1 independently conducts sequential stages of reasoning, leading to a significant improvement in precision. The development of the LLaVA-o1-100k dataset, compiling samples from various sources with structured reasoning annotations, further enhances its performance. Additionally, the introduction of an innovative inference-time stage-level beam search method allows for effective scaling during inference. This novel approach sets LLaVA-o1 apart as a leader in autonomous multistage reasoning within Vision-Language Models.
https://arxiv.org/abs/2411.10440