Language models benefit from using filler tokens to solve hard algorithmic tasks, even if not in a human-like manner. However, learning to use filler tokens effectively requires dense supervision. The study reveals that additional tokens can offer computational advantages regardless of their content. Concerns arise regarding the potential for large language models to conduct hidden, unauditable computations detached from the observed chain-of-thought tokens. The research sheds light on the importance of token choice in language models and how they impact task performance. It also highlights the need for specific supervision to optimize the use of filler tokens in model training.
https://arxiv.org/abs/2404.15758