Fuyu-8B: A multimodal architecture for AI agents

We are excited to release Fuyu-8B, a smaller version of our multimodal model. Fuyu-8B has a simpler architecture and training procedure, making it easier to understand, scale, and deploy. It is designed specifically for digital agents and can support various image resolutions, answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images. The model is fast, delivering responses for large images in less than 100 milliseconds. Despite being optimized for our use-case, it performs well on standard image understanding benchmarks. Fuyu-8B is available under an open license, and we can’t wait to see what the community builds with it.

https://www.adept.ai/blog/fuyu-8b