Using GPT-4 Vision with Vimium to browse the web

vimGPT is a project that aims to give multimodal models a way to browse the web. The project explores the possibility of using GPT-4V’s vision capabilities for web browsing. One challenge is determining the model’s preferences without the browser DOM as text. To address this, the project considers using Vimium, a Chrome extension that allows users to navigate the web using only their keyboard. The author suggests ideas for improvement, such as using the Assistant API for context retrieval, developing a specialized version of Vimium for overlaying elements, and incorporating higher resolution images for better detection. They also mention the potential use of LLaVa or CogVLM for faster and more accurate results. The author also highlights the need for adding speech-to-text functionality to make the project more accessible. The references provided offer additional resources for interested readers.

https://github.com/ishan0102/vimGPT