The author expresses frustration with the literature on attention and acknowledges their own lack of understanding of the topic. They highlight the need to understand attention in order to stay relevant in the field of AI. The author notes that the concept of attention in neural networks is essentially a form of kernel smoothing, a technique that has been used for decades, but is impressed by the engineering achievements of modern large language models. The author also discusses identification failures and the use of multi-headed attention in transformers. They admit to not having enough energy to fully explain transformers in detail.
http://bactra.org/notebooks/nn-attention-and-transformers.html