How to make LLMs go fast

The author begins by discussing the issue of slow inference with autoregressive generate functions in large language models (LLMs). They explain that the slow performance is due to both algorithmic and hardware reasons. The algorithmic reason is that the model has to process an increasing number of tokens with each cycle, leading to a quadratic algorithm. The hardware reason is that LLMs are huge and their weights do not fit in cache, resulting in slow loading from RAM. The author suggests various ways to make LLMs faster, including better hardware utilization and clever decoding tricks. They also discuss the benefits of batching and continuous batching for faster inference. Additionally, they explore the use of smaller floating point formats, such as fp16 and bfloat16, and the concept of KV caching for improved performance. Overall, the content provides a comprehensive overview of the challenges and techniques involved in speeding up LLM inference. There is no significant controversial or surprising information.

https://vgel.me/posts/faster-inference/