Video-LLaVA – Oh TL;DR

Video-LLaVA is a project that focuses on learning united visual representation by alignment before projection. The project highlights its remarkable interactive capabilities between images and videos, even without image-video pairs in the dataset. By binding unified visual representations to the language feature space, the project enables a large language model (LLM) to perform visual reasoning on both images and videos simultaneously. The project demonstrates high performance through complementary learning with both video and image modalities. The web content also provides instructions for using the demo and code, as well as information on requirements and installation. Lastly, the content includes acknowledgements, related projects, license information, and citations for the project.

https://github.com/PKU-YuanGroup/Video-LLaVA