In this paper, we introduce Tensor Product Attention (TPA), a unique attention mechanism that utilizes tensor decompositions to reduce memory overhead in language models. TPA, integrated with RoPE, offers improved model quality and memory efficiency by factorizing representations into low-rank components. We present the Tensor ProducT ATTenTion Transformer (T6), which outperforms traditional Transformer models like MHA, MQA, GQA, and MLA in language modeling tasks. TPA’s memory efficiency allows for processing longer sequences under resource constraints, addressing scalability issues in modern language models. This groundbreaking approach offers significant improvements in performance and memory usage.
https://arxiv.org/abs/2501.06425