This post is about constructing an LLM inference engine from scratch using C++ and CUDA without libraries, shedding light on the full stack of LLM inference. Inference compute is vital as AI models scale, deploying models locally for edge devices. The focus is on building a program that loads weights of common models for single-batch inference on a CPU + GPU server and enhancing token throughput. The post touches on LLM architectures, inference mechanics, bottlenecks, and benchmarks, tied to optimizing for memory bandwidth. Acknowledging works that inspired this project, the source code is available on GitHub.
https://andrewkchan.dev/posts/yalm.html