In this paper, the author explores the use of small language models (SLMs) for deployment on resource-constrained devices to address privacy concerns. They highlight the efficiency of spiking neural networks (SNNs) and introduce Sorbet, a transformer-based spiking language model designed to be compatible with neuromorphic hardware. Sorbet utilizes a shifting-based softmax and power normalization method to reduce energy consumption while maintaining competitive performance. Through knowledge distillation and model quantization, Sorbet achieves a highly compressed binary weight model. Extensive testing on the GLUE benchmark and ablation studies demonstrate Sorbet’s potential as an energy-efficient solution for language model inference.
https://arxiv.org/abs/2409.15298