Think Before You Speak: Training Language Models with Pause Tokens

The author of this web content discusses the idea of allowing language models to manipulate a greater number of hidden vectors before generating responses. They propose the use of a “pause” token, which is appended to the input prefix and delays the extraction of the model’s outputs until the last pause token is seen. By implementing this approach, the model can process extra computation before committing to an answer. The author empirically evaluates this technique on decoder-only models and finds that inference-time delays result in gains, particularly on tasks like question-answering and reasoning. The author concludes by highlighting the potential for delayed next-token prediction to become a new paradigm in language modeling research.

https://arxiv.org/abs/2310.02226