In this study, we explore the combination of model compression and efficient attention mechanisms in language models. We evaluate the impact of knowledge distillation on efficient attention transformers and compare their performance to full attention models. Additionally, we introduce a new dataset, GONERD, for evaluating the performance of Named Entity Recognition (NER) models on long sequences. Our findings indicate that distilling efficient attention transformers can maintain a significant amount of the original model’s performance while reducing inference times. We conclude that knowledge distillation is an effective method for achieving high-performance efficient attention models at a lower cost.
https://arxiv.org/abs/2311.13657