In this article, the authors discuss the development of a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. This enables large language models to be used with significantly less GPU memory, while retaining full precision performance. The authors offer a two-part quantization procedure, LLM.int8(), using vector-wise quantization with separate normalization constants for most features and a mixed-precision decomposition scheme for emergent outliers. With LLM.int8(), the authors showed empirically that it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This development can make large language models more accessible for use on consumer GPUs, and the authors have open-sourced their software.
https://arxiv.org/abs/2208.07339