In this study, the authors introduce block diffusion language models that combine the benefits of diffusion and autoregressive models, allowing for flexible-length generation and improved inference efficiency. By utilizing KV caching and parallel token sampling, block diffusion outperforms traditional models in likelihood modeling. The proposed training algorithm and noise schedules help minimize variance, achieving state-of-the-art performance on language modeling benchmarks and enabling the generation of sequences of any length. The code, model weights, and project details are available on the project page. (Word count: 91)
https://arxiv.org/abs/2503.09573