FlashMLA is a high-performing MLA decoding kernel designed for Hopper GPUs, capable of achieving impressive speeds and computational power on the H800 SXM5 using CUDA 12.6. The BF16 Paged kvcache with a block size of 64 is now available with easy installation through Python. With up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound configurations, FlashMLA is optimized for variable-length sequences serving. The requirements include Hopper GPUs, CUDA 12.3 and above, and PyTorch 2.0 and above. This project is inspired by FlashAttention 2&3 and cutlass projects, promising efficient MLA decoding.
https://github.com/deepseek-ai/FlashMLA