Based: Simple linear attention language models

Based is a revolutionary architecture that combines sliding window attention and linear attention to enhance language modeling with strong associative recall. Compared to Transformers using Flash-Attention 2, Based offers a 24x improvement in throughput during inference. By balancing recall abilities with the size of the recurrent state, Based outperforms prior sub-quadratic models in real-world tasks like information extraction and reading comprehension. Despite its simplicity, Based competes with Mamba in language modeling, showcasing promising results on recall-intensive benchmarks. Through innovative featurization techniques and IO-aware algorithms, Based achieves remarkable speed and efficiency, making it a noteworthy advancement in the field of language modeling.

https://www.together.ai/blog/based