In this paper, the authors address the challenges of deploying large language models (LLMs) on resource-constrained embedded devices. They introduce an FPGA-based accelerator that utilizes post-training quantization to improve LLM inference performance by reducing model size and optimizing off-chip memory bandwidth. The design includes asynchronous computation and a fully pipelined accelerator for matrix-vector multiplication. Experiments with the TinyLlama 1.1B model on a Xilinx ZCU102 platform demonstrated a 14.3-15.8x speedup and a 6.1x power efficiency improvement compared to running exclusively on the ZCU102 processing system.
https://arxiv.org/abs/2409.11424