Efficient streaming language models with attention sinks

Efficient Streaming Language Models with Attention Sinks is a paper that addresses the challenges of deploying Large Language Models (LLMs) in streaming applications, such as multi-round dialogues. The paper introduces StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without fine-tuning. By retaining only the most recent tokens and attention sinks, StreamingLLM allows the model to generate coherent text without a cache reset. The paper also highlights the benefits of adding a placeholder token as an attention sink during pre-training. StreamingLLM outperforms the sliding window recomputation baseline in terms of speedup. However, it is important to note that StreamingLLM can only process the latest tokens and does not expand the LLMs’ context window or enhance their long-term memory. StreamingLLM is optimized for streaming applications and is ideal for scenarios where continuous operation without extensive memory or dependency on past data is required. It can be integrated with recent context extension methods. The paper concludes by mentioning upcoming releases of code, data, and evaluation resources related to StreamingLLM.

https://github.com/mit-han-lab/streaming-llm