16-Bit to 1-Bit: Visual KV Cache Quantization for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) have been successful in various applications, but face challenges in computational overhead during deployment. Key-Value (KV) caching helps with inference efficiency by trading memory for computation, but extensive KV caches lead to reduced throughput and long-term execution on devices with limited GPU memory. Existing methods sacrifice potentially important information by dropping unimportant tokens to reduce cache size. Instead, a new visual quantization strategy is proposed, preserving all visual tokens while significantly reducing memory usage. With group-specific and quantile-based quantization, an extreme 1-bit quantization ratio is achieved. This plug-and-play method improves memory efficiency in MLLMs without changing architecture, maintaining computational efficiency and multimodal performance.

https://arxiv.org/abs/2502.14882

To top