Benchmarking vision-language models on OCR in dynamic video environments

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments, with a curated dataset of 1,477 manually annotated frames from various domains. Leading VLMs like Claude-3, Gemini-1.5, and GPT-4o are compared against traditional OCR systems. Results show VLMs have potential to outperform OCR models but face challenges like hallucinations and sensitivity to text styles. The dataset and benchmarking framework are available for further research. (Approx. 85 words)

https://arxiv.org/abs/2502.06445

To top