Magma is a groundbreaking foundation model that excels in interpreting and grounding multimodal inputs to achieve various tasks. Combining vision-language understanding with spatial and temporal intelligence, Magma can navigate complex settings and complete tasks like UI navigation and robot manipulation. Pretrained on a diverse range of datasets, Magma outperforms other models in both in-distribution and out-of-distribution tasks. With unique features like Set-of-Mark for action grounding and Trace-of-Mark for action planning, Magma showcases impressive results across different domains. Its ability to answer spatial reasoning questions and perform well on video QA benchmarks highlights its versatility and intelligence.
https://microsoft.github.io/Magma/