Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

Frontier AI has reached new speeds with the Llama 3.1 405B model on Cerebras Inference, running at a record-breaking 969 tokens/s. This model outshines competitors like GPT-4o and Claude 3.5 Sonnet, delivering the highest performance at 128K context length and shortest time-to-first-token latency. Surprisingly, Cerebras achieved 969 output tokens per second, 12x faster than the best GPU result. With only a 240ms time to first token, Cerebras Inference offers the fastest latency, vastly improving user experiences in voice and video AI applications. Open-source Llama 3.1-405B is now over 10 times faster than closed models, setting a new standard for instant AI.

https://cerebras.ai/blog/llama-405b-inference