Large language models (LLMs) have been relying on decoder-only transformer architectures, leading to significant GPU memory demands due to the retained keys/values information for historical tokens. In response to this issue, Anchor-based LLMs (AnLLMs) have been introduced, utilizing an innovative anchor-based self-attention network (AnSAN) and inference strategy to compress sequence information into an anchor token. Despite a slight decrease in accuracy, AnLLMs offer up to 99% keys/values cache reduction and up to 3.5 times faster inference, highlighting their potential for practical LLM applications in resource efficiency and computational speed.
https://arxiv.org/abs/2402.07616