Vision language models are blind

Vision language models (VLMs) like GPT-4o and Gemini-1.5 Pro power image-text applications but struggle with tasks humans find easy, such as identifying overlapping circles and counting letter intersections. VLMs perform well on some tasks but fail miserably on others, performing poorly even on basic tasks like counting line intersections accurately. Despite their strength in answering questions on charts and graphs, VLMs struggle with simple visual tasks, raising concerns about the accuracy and reliability of their vision capabilities. VLMs display a mix of impressive accuracy and surprising failures when challenged with various visual tasks.

https://vlmsareblind.github.io/

To top