Pre-trained large language models (LLMs) can efficiently compute basic arithmetic like addition by using Fourier features as dimensions in the hidden state. This innovative approach shows that MLP and attention layers in the model use Fourier features in a complementary manner to approximate magnitude and perform modular addition, respectively. Pre-training is essential for this mechanism, as models trained from scratch only utilize low-frequency features, resulting in lower accuracy. By introducing pre-trained token embeddings to a randomly initialized model, performance can be improved. This study highlights how appropriate pre-trained representations, such as Fourier features, can enhance the capabilities of Transformers to learn precise mechanisms for algorithmic tasks.
https://arxiv.org/abs/2406.03445