No “Zero-Shot” Without Exponential Data

The author delves into the questionable “zero-shot” generalization capabilities of multimodal models like CLIP and Stable-Diffusion, attributed to their web-crawled pretraining datasets. Through an extensive study involving 34 models and five large datasets, they discover a concerning trend: multimodal models need exponentially more data to improve downstream “zero-shot” performance linearly. Surprisingly, even when controlling for dataset similarities and testing on synthetic data distributions, the trend persists. Benchmarking these models on long-tailed data leads to poor performance, prompting the creation of the “Let it Wag!” benchmark for further research. The study highlights the critical need for extensive training data to achieve “zero-shot” generalization under large-scale training.

https://arxiv.org/abs/2404.04125

To top