QLoRA: Efficient Finetuning of Quantized LLMs

QLoRA is a new finetuning approach that allows a 65B parameter model to be finetuned on a single 48GB GPU while maintaining full 16-bit finetuning task performance. Using QLoRA, researchers were able to finetune more than 1,000 models, providing state-of-the-art results on various chatbot performance evaluations. The QLoRA approach introduces several innovations to save memory, including a new data type called 4-bit NormalFloat (NF4), double quantization to reduce average memory footprint, and paged optimizers to manage memory spikes. Finally, researchers demonstrate that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation when evaluating chatbot performance.

https://arxiv.org/abs/2305.14314