Recent works on extreme low-bit quantization have been gaining attention in the machine learning community, with the potential for matrix multiplication without actual multiplications to improve compute efficiency. The blog explores quantizing pre-trained models with extreme settings, including binary weights, using HQQ+. Surprisingly, fine-tuning just a fraction of the weights on top of an HQQ-quantized model significantly improves output quality even at 1-bit, surpassing smaller full-precision models. The experiment demonstrates that extreme low-bit quantization can enhance machine learning models’ performance while reducing memory and compute requirements, making larger models more accessible. This work may spark interest in further developing software and hardware to fully take advantage of this approach.
https://mobiusml.github.io/1bit_blog/