The article discusses various efforts to improve AI efficiency in terms of compute usage, focusing on the NVIDIA H100 GPU. New instructions like WGMMA are crucial for maximizing performance, while shared memory and address generation also play key roles. The ThunderKittens embedded DSL simplifies kernel writing and optimizes hardware utilization by focusing on specific tensor operations. The philosophy behind ThunderKittens emphasizes the importance of small tiles matching AI and hardware needs and suggests a shift in perspective towards aligning AI algorithms with hardware capabilities. Overall, ThunderKittens aims to streamline the process of creating high-performing AI kernels. Stay tuned for ThunderKittens on AMD hardware!
https://hazyresearch.stanford.edu/blog/2024-05-12-tk