In this paper, the authors address the challenges of high throughput serving for large language models (LLMs) that require batching multiple requests. The existing systems struggle due to the large and dynamically growing/shrinking key-value cache (KV cache) memory for each request, leading to wasted memory and limiting the batch size. To solve this problem, the authors propose PagedAttention, an attention algorithm inspired by virtual memory and paging techniques in operating systems. They further develop vLLM, an LLM serving system that minimizes waste in KV cache memory and allows flexible sharing of cache across requests. Evaluations demonstrate that vLLM significantly improves throughput compared to state-of-the-art systems. The source code for vLLM is publicly available.
https://arxiv.org/abs/2309.06180