This paper explores the connection between transformer architecture and support vector machines (SVMs) in the context of natural language processing (NLP). The authors establish a formal equivalence between the optimization geometry of self-attention in transformers and a hard-margin SVM problem. They show that optimizing the attention layer with vanishing regularization converges in the direction of an SVM solution that minimizes the nuclear norm of the combined parameter. This convergence can occur towards locally-optimal directions rather than global ones. Over-parameterization is found to facilitate global convergence and ensure a favorable optimization landscape. The authors also present a more general SVM equivalence for nonlinear prediction heads and discuss potential applications and open research directions. Overall, this work offers insights into the underlying mechanisms of transformers and their relationship to SVMs in NLP tasks.
https://arxiv.org/abs/2308.16898