Transformers Without Normalization

Transformers without normalization layers can perform as well as or better than their normalized counterparts by incorporating a simple and effective technique called Dynamic Tanh (DyT). The DyT layer is a replacement for traditional Layer Norm or RMSNorm layers and is inspired by the observation that layer normalization often generates tanh-like input-output mappings in Transformers. Implementation of DyT is straightforward in PyTorch code. Experimental results show that Transformers with DyT across various tasks ranging from vision to language models achieve similar or improved performance compared to normalized Transformers. This challenges the conventional belief that normalization layers are essential in modern neural networks.

https://jiachenzhu.github.io/DyT/