Consistency LLM: converting LLMs to parallel decoders accelerates inference 3.5x

In this blog, we introduce Consistency Large Language Models (CLLMs), a new family of parallel decoders capable of reducing inference latency by efficiently decoding an $n$-token sequence per inference step. Our research shows that pretrained LLMs can be taught to operate as efficient parallel decoders by mimicking the human cognitive process of forming complete sentences in mind before articulating word by word. CLLMs show significant improvements in generation speed, matching or surpassing other fast inference techniques like Medusa2 and Eagle without additional memory costs. The approach involves training CLLMs to consistently map any point on a Jacobi trajectory to a fixed point. This method not only speeds up the decoding process but also facilitates the learning of linguistic concepts like collocations.

https://hao-ai-lab.github.io/blogs/cllm/