DenseFormer: Enhancing Information Flow in Transformers

The transformative DenseFormer modification to the standard transformer architecture proposed by Vaswani et al. in 2017 improves model perplexity without increased size, only requiring the addition of a few thousand parameters for large-scale models. The Depth-Weighted-Average (DWA) operation, which computes a weighted average of current and past representations after each transformer block, leads to a more efficient data utilization and reveals coherent patterns of information flow. Experiments show that DenseFormer achieves the same perplexity as deeper transformers while being more memory-efficient and faster at inference, marking a significant advancement in model efficiency.

https://arxiv.org/abs/2402.02622