In this web content, the author discusses the possibility of running large language models (LLMs) locally on various hardware devices by rewriting the LLaMa inference code in raw C++ and optimizing it. The author highlights the surprising fact that it is possible to run a 7B parameter model locally on a Pixel5 smartphone at 1 tokens/s, a M2 Macbook Pro at ~16 tokens/s, and even a 4GB RAM Raspberry Pi at 0.1 tokens/s. The author dives into the math behind the inference requirements, emphasizing that memory bandwidth is the bottleneck for inference. The use of quantization, which reduces memory requirements, is highlighted as a clever technique. Overall, the author emphasizes the importance of reducing memory requirements and highlights the potential of distillation and training smaller models for longer.
https://finbarr.ca/how-is-llama-cpp-possible/