StreamingLLM: Efficient streaming technique enable infinite sequence lengths

In this paper, the authors address the challenges of deploying Large Language Models (LLMs) in streaming applications that involve long interactions. They highlight two main issues: the memory consumption during the decoding stage and the inability of popular LLMs to generalize to longer texts than the training sequence length. The authors propose a solution called StreamingLLM, which utilizes a window attention approach but also introduces the concept of attention sink. This attention sink, achieved by keeping the Key and Value states of initial tokens, significantly improves the performance of window attention. The StreamingLLM framework enables LLMs to generalize to infinite sequence lengths without requiring fine-tuning. The authors demonstrate the effectiveness of StreamingLLM in various language modeling tasks and show that it outperforms the sliding window recomputation baseline with a speedup of up to 22.2 times in streaming settings.

https://arxiv.org/abs/2309.17453