In this paper, the author introduces a Bayesian learning model to analyze Large Language Models (LLMs) behavior, focusing on the optimization metric of predicting the next token. They develop a unique model based on constructing an ideal generative text model represented by a multinomial transition probability matrix with a prior. Through discussing the mapping between embeddings and multinomial distributions and the Dirichlet approximation theorem, they show how LLMs approximate this matrix. Surprisingly, they find that LLMs’ text generation aligns with Bayesian learning principles and explain the emergence of in-context learning in larger models. The study suggests that LLMs’ behavior is coherent with Bayesian Learning, opening up new possibilities for their application.
https://arxiv.org/abs/2402.03175