Transformers have taken over the foundation models due to their top-notch performance but suffer from high scaling costs. TokenFormer presents a scalable solution by utilizing the attention mechanism to handle interactions between tokens and model parameters. This novel approach allows for progressive scaling without the need for retraining from scratch. By treating model parameters as tokens and replacing linear projections with token-parameter attention layers, TokenFormer achieves commendable performance with reduced training costs. The model sizes can range from 124M to 1.4B parameters, demonstrating efficiency and flexibility. Surprisingly, the code and models are readily available for exploration.
https://arxiv.org/abs/2410.23168