Language models struggle with coherence and fluency in small settings, like GPT-Neo or GPT-2 with around 125M parameters. This raises questions on whether larger scales and complex architectures are needed for fluent English text production. TinyStories, a dataset of kid-friendly stories generated by GPT-3.5 and GPT-4, shows that even smaller models can create diverse, grammatically correct stories with reasoning abilities. A novel evaluation framework using GPT-4 to grade model output as if written by students overcomes the limitations of current benchmarks. TinyStories aims to aid LM development in low-resource areas and explore language capabilities.
https://arxiv.org/abs/2305.07759