UltraFastBERT is a BERT variant that demonstrates how language models can use only a small fraction of their neurons for individual inferences. In their study, the researchers show that UltraFastBERT achieves comparable performance to other BERT models, while using only 0.3% of its neurons during inference. This is possible by employing fast feedforward networks (FFFs) instead of regular feedforward networks. The researchers offer high-level CPU code, as well as a PyTorch implementation that delivers significant speedup compared to the optimized baseline and batched feedforward implementations. They also provide access to their training code, benchmarking setup, and model weights.
https://arxiv.org/abs/2311.10770