In this study, it is argued that large language models (LLMs) claiming to excel in various functions and tasks show a dramatic breakdown in function and reasoning when presented with a simple common sense problem easily solvable by humans. The models exhibit strong overconfidence in their incorrect solutions, providing nonsensical explanations to justify their responses. Standard interventions fail to correct the models’ errors, prompting a call for urgent re-assessment of the claimed capabilities of current LLMs and the need for standardized benchmarks to detect reasoning deficits. The study also provides code for reproducing experiments and raw data.
https://arxiv.org/abs/2406.02061