A Theory on Adam Instability in Large-Scale Machine Learning

In this paper, the authors propose a theory to explain the divergent behavior observed in the training of large language models. They attribute this phenomenon to the Adam optimization algorithm, which can reach a state where the parameter update vector has a large norm and is uncorrelated with the direction of descent on the training loss landscape, resulting in divergence. This artifact is more likely to occur in the training of deep models with large batch sizes, which is commonly seen in large-scale language model training. The authors support their theory with observations from training runs of language models with different scales.

https://arxiv.org/abs/2304.09871

To top