In this research, the authors address the limitations of Transformers in handling long sequences due to their quadratic complexity and weak length extrapolation. They introduce Megalodon, a neural architecture optimized for efficient sequence modeling with unlimited context length. Megalodon builds on the Mega architecture, incorporating components like complex exponential moving average, timestep normalization layer, normalized attention mechanism, and pre-norm with two-hop residual configuration. Surprisingly, Megalodon outperforms Transformer in efficiency with 7 billion parameters and 2 trillion training tokens. The research provides valuable insights into advancing pretraining efficiency and downstream task accuracy in sequence modeling.
https://arxiv.org/abs/2404.08801