Attention Is Off By One

In this blog post, the author discusses an off-by-one error in the attention formula used in modern AI models. They explain how this error affects the compression and deployment of Transformer models. The author argues that the error lies in the softmax function used in the attention mechanism and proposes a simple tweak to fix it. They provide a mathematical explanation of the bug and how it impacts the behavior of attention heads. The proposed solution, called “QuietAttention,” involves modifying the softmax function. The author invites others to run experiments and collaborate on further research.

https://www.evanmiller.org/attention-is-off-by-one.html