Infini-Gram: Scaling unbounded n-gram language models to a trillion tokens

In this study, the relevance of $n$-gram language models in the age of large neural language models (LLMs) is explored. The authors argue that $n$-gram models are still valuable and can enhance neural LLMs. They present a groundbreaking approach by creating the largest $n$-gram LM ever, training it on 5 trillion tokens, and introducing an $infty$-gram LM with backoff. The development of the infini-gram engine enables efficient computation of $infty$-gram probabilities. Surprisingly, the study reveals high accuracy in next-token prediction using the $infty$-gram LM, highlighting its potential to improve neural LLMs. This research sheds light on the limitations of neural LLM pretraining and Transformer positional embeddings.

https://arxiv.org/abs/2401.17377