Ferret: A Multimodal Large Language Model

Ferret is an end-to-end machine learning model that allows for fine-grained and open-vocabulary referring and grounding. It combines hybrid region representation with a spatial-aware visual sampler. The model is trained using the GRIT dataset, a large-scale, hierarchical, and robust ground-and-refer instruction tuning dataset. Ferret also introduces Ferret-Bench, a multimodal evaluation benchmark that tests referring/grounding, semantics, knowledge, and reasoning. The code for the Ferret model and Ferret-Bench is available, but it is intended for research use only and should not be used for commercial purposes. Overall, Ferret offers a powerful solution for referential tasks in machine learning.

https://github.com/apple/ml-ferret