Chameleon: Meta’s New Multi-Modal LLM

Chameleon is a groundbreaking early-fusion token-based mixed-modal model that excels in understanding and generating both images and text in any order. The model’s stable training approach, alignment recipe, and tailored architectural parameterization make it stand out in tasks like visual question answering, image captioning, and text generation. Surprisingly, Chameleon outperforms Llama-2 in text-only tasks and competes with larger models like Mixtral 8x7B and Gemini-Pro in various areas. Additionally, Chameleon’s performance in image generation tasks is noteworthy, placing it above models like Gemini Pro and GPT-4V in long-form mixed-modal generation assessments. Overall, Chameleon showcases significant progress in modeling full multimodal documents.

https://arxiv.org/abs/2405.09818