VideoPrism: A foundational visual encoder for video understanding

Google Research introduces VideoPrism, a versatile video encoder that provides state-of-the-art results on various video understanding tasks using a single frozen model. VideoPrism is designed to handle diverse video data, pre-trained on a massive dataset of 36 million high-quality video-text pairs and 582 million video clips with noisy parallel text. The model leverages contrastive learning and masked video modeling to learn from both text descriptions and visual content within videos, excelling in tasks that require understanding both appearance and motion. VideoPrism surpasses other models on scientific applications, showcasing its potential to transform video analysis across different fields.

https://research.google/blog/videoprism-a-foundational-visual-encoder-for-video-understanding/