OmniParser for Pure Vision Based GUI Agent

The OMNIPARSER tool enhances GPT-4V’s ability to accurately generate actions on user interfaces by parsing screenshots and identifying interactable icons and their semantics. A curated dataset of interactable icon detection is used to fine-tune models for this purpose, resulting in improved performance on various benchmarks. Interestingly, OMNIPARSER outperforms GPT-4V baselines even with screenshot-only input, showing its effectiveness. When combined with other vision language models like Phi-3.5-V and Llama-3.2-V, the interactable region detection model significantly boosts overall performance. OMNIPARSER proves to be a valuable plugin choice for off-the-shelf vision language models, providing local semantics of icon functionality for better task performance.

https://microsoft.github.io/OmniParser/