Comparing Humans, GPT-4, and GPT-4V on Abstraction and Reasoning Tasks

In this study, we examine the abstract reasoning abilities of GPT-4, both in its text-only and multimodal versions. We utilize the ConceptARC benchmark, which is specifically designed to evaluate understanding and reasoning skills using core-knowledge concepts. Building upon the previous work by Moskvichev et al., our evaluation involves more detailed one-shot prompts instead of simple zero-shot prompts for text versions of ConceptARC tasks. Additionally, we evaluate GPT-4V, the multimodal version of GPT-4, using zero- and one-shot prompts with image versions of the simplest tasks. Surprisingly, our experimental results suggest that neither version of GPT-4 has achieved robust abstraction abilities comparable to humans.

https://arxiv.org/abs/2311.09247