FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

Team PyTorch introduces FlexAttention, a new PyTorch API that allows for creating and implementing various attention variants with ease. By providing a flexible API, FlexAttention simplifies the process of implementing new attention mechanisms, such as Causal, Relative Positional Embeddings, Alibi, Sliding Window Attention, and more, in just a few lines of PyTorch code. With features like score_mod and mask_mod, users can modify attention scores and take advantage of sparsity in attention masks, resulting in significant performance improvements. FlexAttention’s ability to fuse different attention variants into a single kernel efficiently addresses the challenge of optimizing attention mechanisms in machine learning applications. Check out the Attention Gym for more examples and contribute your applications to this exciting new API.

https://pytorch.org/blog/flexattention/