DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

DeepGEMM is a sleek library specifically crafted for efficient FP8 General Matrix Multiplications (GEMMs), supporting both normal and Mix-of-Experts (MoE) grouped GEMMs. The unique aspect of DeepGEMM lies in its utilization of CUDA, allowing Just-In-Time (JIT) compilation at runtime without the need for prior compilation. This library exclusively caters to NVIDIA Hopper tensor cores and implements advanced techniques like CUDA-core two-level accumulation to overcome FP8 tensor core accumulation imprecision. With a focus on simplicity, DeepGEMM offers a lightweight design with stellar performance, rivaling or surpassing specialized libraries across various matrix shapes. Despite occasional hiccups, the library welcomes optimization contributions, making it a valuable resource for mastering Hopper FP8 matrix multiplication techniques.

https://github.com/deepseek-ai/DeepGEMM