ScreenAI: A visual LLM for UI and visually-situated language understanding

Introducing ScreenAI, a cutting-edge model for user interfaces and infographics. It excels in understanding layout and QA tasks, using a unique combination of vision and language. With a focus on UI understanding and interaction, ScreenAI is trained on a mixture of datasets, including the new Screen Annotation task. By improving on the PaLI architecture and implementing a flexible patching strategy, ScreenAI achieves state-of-the-art results with only 5B parameters. Additionally, three new datasets are introduced to further evaluate ScreenAI’s capabilities. The model shows promise but acknowledges the need for more research to compete with larger models.

https://research.google/blog/screenai-a-visual-language-model-for-ui-and-visually-situated-language-understanding/