The authors introduce Jamba, a large language model combining Transformer and Mamba layers using a mixture-of-experts architecture. This flexible design allows for efficient parameter usage and high performance, fitting in a single 80GB GPU. Jamba excels in long-context evaluations, showing strong results for up to 256K tokens context length. The study explores key architectural decisions and properties, with plans to release checkpoints for further research. The weights of Jamba are available under a permissive license, encouraging exploration of this innovative model.
https://arxiv.org/abs/2403.19887