SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

The paper introduces SplitQuantV2, an innovative algorithm that improves low-bit linear quantization of large language models (LLMs) without the need for high-end GPUs or specific DNN frameworks. The platform-agnostic and efficient nature of SplitQuantV2 allows for easy implementation on various devices, including edge AI devices. The algorithm significantly enhances the accuracy of the INT4 quantization model, matching the performance of the original floating-point model. Impressively, SplitQuantV2 only took 2 minutes and 6 seconds to preprocess the 1B model and perform linear INT4 quantization using an Apple M4 CPU. This presents a practical solution for deploying LLMs on devices with limited computational resources.

https://arxiv.org/abs/2503.07657