The capabilities and limitations of Large Language Models (LLMs) have been explored, revealing a mixed picture. While LLMs display problem-solving skills, they exhibit reasoning gaps compared to humans, raising concerns about their generalization abilities. Due to the massive amount of data used in LLM design, traditional methods for measuring generalization like train-test set separation are impractical. By studying the pretraining data of LLMs, researchers uncover how these models approach reasoning tasks. Surprisingly, LLMs often rely on procedural knowledge from influential documents to tackle reasoning problems, in contrast to factual questions where answers are prominent in the data. This suggests LLMs use a unique, generalizable strategy for reasoning tasks.
https://arxiv.org/abs/2411.12580