Can Language Models Serve as Text-Based World Simulators?

In this study, we explore the potential for language models to act as world simulators in text-based environments, aiming to eliminate the need for manual coding in building virtual environments for complex tasks. The ByteSized32-State-Prediction benchmark is introduced to assess the ability of models like GPT-4 to predict game state transitions accurately. Despite GPT-4’s strong performance, it falls short as a reliable world simulator without additional advancements. This research sheds light on the strengths and limitations of current language models, offering a unique benchmark to measure progress in this field as new models emerge.

https://arxiv.org/abs/2406.06485