In this web content, the author discusses the implementation and computational cost of linear transformers in comparison to traditional transformers in deep neural networks. It is commonly believed that linear transformers, which have a linear computational cost instead of a quadratic cost, would be faster to train. However, initial experiments have shown that they train more slowly than traditional transformers. The author presents different implementations of linear transformers and demonstrates that they can actually achieve significant speed-ups. They also discuss the trade-off between different algorithms in terms of speed and memory requirements. The author emphasizes the importance of efficient implementation and explores various parallel implementations of linear transformers. They introduce the attention formulation and state formulation, which have different computational costs. Furthermore, the author suggests using the attention formulation for small contexts and the state formulation for large contexts. They also propose a chunked formulation that combines the benefits of both attention and state formulations, resulting in optimal performance. Lastly, the author highlights the need for further research to improve the learning of linear transformers and enable the training of language models with large context sizes.
https://manifestai.com/blogposts/faster-after-all/