Diff Transformer addresses the issue of Transformer overallocating attention to irrelevant context by introducing a differential attention mechanism that focuses on relevant context and cancels noise. The subtraction of separate softmax attention maps promotes sparse attention patterns, leading to better performance in language modeling compared to Transformer. Diff Transformer also offers advantages in long-context modeling, key information retrieval, and reducing activation outliers. It mitigates hallucination in question answering and text summarization, improves accuracy in in-context learning, and is more robust to order permutation. This innovative architecture shows promise in advancing large language models.
https://arxiv.org/abs/2410.05258