Scalable MatMul-Free Language Modeling

In this research, the authors explore how Matrix multiplication (MatMul) can be eliminated from large language models (LLMs) while maintaining strong performance at billion-parameter scales. Their experiments reveal that MatMul-free models can achieve comparable results to state-of-the-art Transformers with significantly less memory usage during inference. They also provide a GPU-efficient implementation that reduces memory usage during training by up to 61%. Surprisingly, they developed a custom hardware solution on an FPGA that processed billion-parameter scale models at 13W, achieving efficiency comparable to the human brain. This work highlights the potential for future accelerators to optimize operations for lightweight LLMs.

https://arxiv.org/abs/2406.02528