This repository contains the code for QUIK, a method for quantizing the majority of the weights and activations to 4bit post-training. QUIK is a technique described in a paper titled “QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models”. To install the necessary dependencies and compile the code, follow the instructions provided. The repository also includes examples and benchmarks for the QUIK method. To use the quantized model, the weights must be quantized using the GPTQ algorithm and the Linear layers need to be replaced with QUIK Linear layers. More details on these steps can be found in the code. The full paper and citation are available on arXiv.
https://github.com/IST-DASLab/QUIK