Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

Llama3-V has taken the spotlight by outperforming GPT3.5 and even GPT4 in certain benchmarks. This first-ever multimodal model built on Llama3, comes with the bonus of being trained under $500. The benchmarks speak for themselves, showcasing a 10–20% boost over the current SOTA model, Llava. What sets Llama3-V apart is its focus on understanding visual information, using the SigLIP model to embed input images and align them with textual tokens. Through innovative training optimizations like caching and MPS/MLX enhancements, Llama3-V offers impressive performance comparable to models 100x larger in size. Explore Llama3-V on 🤗 and Github to witness this cutting-edge advancement in action.

https://aksh-garg.medium.com/llama-3v-building-an-open-source-gpt-4v-competitor-in-under-500-7dd8f1f6c9ee