SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency

Sequoia is a revolutionary speculative decoding framework that allows serving massive LLMs with low latency on consumer GPUs without approximation. This framework, developed by Carnegie Mellon University, Together AI, Yandex, and Meta AI, outperforms other systems like DeepSpeed-Zero Offloading. By using dynamic programming and sampling without replacement algorithms, Sequoia achieves scalability and robustness in serving LLMs like Llama2-70B and Vicuna-33B. With the potential to run on future hardware like A100 and L40, Sequoia promises to open up new possibilities for AI-generated content applications on low-cost GPUs.

https://infini-ai-lab.github.io/Sequoia-Page/