Accelerated AI Inference via Dynamic Execution Methods

This paper focuses on Dynamic Execution techniques that optimize computation flow based on input, aiming to identify simpler problems to solve with fewer resources, mirroring human cognition. Various methods discussed include early exit from deep networks, speculative sampling for language models, and adaptive steps for diffusion models, showing significant improvements in latency and throughput without sacrificing quality. The increasing demand for compute resources in data centers to the edge, particularly for Generative AI, is addressed by innovative optimizations, such as more efficient sampling methods and predicting optimal stopping points. Integrations into Intel performance libraries and Huggingface Optimum aim to simplify usage and increase adoption of these techniques.

https://arxiv.org/abs/2411.00853