All-in-one embedding model for interleaved text, images, and screenshots

VoyageAI is thrilled to introduce the voyage-multimodal-3, a cutting-edge multimodal embedding model that revolutionizes RAG and semantic search capabilities for documents rich in both visuals and text. This innovative model is capable of vectorizing interleaved texts and images, capturing crucial visual features from screenshots of various elements like PDFs, figures, and tables, eliminating the need for complicated document parsing. Compared to existing models, voyage-multimodal-3 enhances retrieval accuracy by an average of 19.63%. It outperforms traditional models like OpenAI CLIP large and Cohere multimodal v3 significantly, showcasing a novel approach to mixed modality search with screenshots and demonstrating exceptional performance across various tasks.

https://blog.voyageai.com/2024/11/12/voyage-multimodal-3/