The transformative DenseFormer modification to the standard transformer architecture proposed by Vaswani et al. in 2017 improves model perplexity without increased size, only requiring the addition of a few thousand parameters for large-scale models. The Depth-Weighted-Average (DWA) operation, which computes a weighted average of current and past representations after each transformer block, leads to a more efficient data utilization and reveals coherent patterns of information flow. Experiments show that DenseFormer achieves the same perplexity as deeper transformers while being more memory-efficient and faster at inference, marking a significant advancement in model efficiency.
https://arxiv.org/abs/2402.02622