Beating cuBLAS in Single-Precision General Matrix Multiplication

This project focuses on optimizing SGEMM operations on NVIDIA GPUs using CUDA code. Inspired by various experts in the field, the project aims to bridge the gap between educational resources and high-performance libraries like cuBLAS and CUTLASS. The blog covers benchmarking, algorithm design, and optimization techniques like inlined PTX and double-buffering. It explores the performance of SGEMM implementations on NVIDIA RTX 3090 and compares results against cuBLAS. The article also delves into memory layout considerations, parallel thread execution with PTX instructions, and the importance of benchmarking code on CUDA devices. The project targets CUDA learners and aims to enhance understanding of high-performance kernels used in AI/ML.

https://salykova.github.io/sgemm-gpu