In this study, we introduce Heima, a novel efficient reasoning framework that utilizes hidden latent space to condense complex reasoning processes in Multimodal Large Language Models (MLLMs). By using a single thinking token, Heima minimizes verbosity and reduces the number of tokens needed for reasoning, leading to higher generation efficiency without sacrificing zero-shot task accuracy. The Heima model not only improves problem-solving capabilities but also enhances the interpretability of reasoning processes. Experimental results validate the effectiveness of Heima in reconstructing multimodal reasoning processes and showcase its potential for improving overall reasoning efficiency.
https://arxiv.org/abs/2501.19201