Scaling Transformers to 1B Tokens

In this paper, the authors address the challenge of scaling sequence length in large language models. Existing methods have limitations in terms of computational complexity and model expressivity, leading to restrictions on the maximum sequence length. To overcome these issues, they propose LongNet, a Transformer variant that can handle sequence lengths of over 1 billion tokens without compromising performance on shorter sequences. They introduce dilated attention, which expands the attentive field exponentially with distance. LongNet offers several advantages, including linear computation complexity, the ability to serve as a distributed trainer for long sequences, and seamless integration with existing Transformer-based optimization. Experimental results show that LongNet performs well on both long-sequence modeling and general language tasks. This work has the potential to revolutionize the modeling of very long sequences, such as entire corpora or even the entire Internet.

https://arxiv.org/abs/2307.02486