Bringing K/V context quantisation to Ollama

K/V context cache quantisation is a groundbreaking advancement in Ollama, offering users the ability to run larger models, expand context sizes for more nuanced responses, and reduce hardware utilization with significant memory savings. By applying Q8_0 or Q4_0 quantisation levels, users can increase context size or run larger models without compromising quality too severely. The implementation of this feature took around 5 months, with challenges including explaining the concept to the community and addressing merge conflicts. Despite limitations in the current version, the integration of K/V context cache quantisation into Ollama marks a significant step forward in maximizing efficiency and performance.

https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/