Mixtral 8x7B: A Sparse Mixture of Experts language model

The authors present Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has a similar architecture to Mistral 7B, but with the distinction that each layer is made up of 8 feedforward blocks or experts. At each layer and for each token, a router network selects two experts to process the current state and combine their outputs. Although each token only interacts with two experts, the selected experts can vary at each timestep. This design allows each token to have access to 47B parameters but only utilizes 13B active parameters during inference. Mixtral outperforms or matches other models like Llama 2 70B and GPT-3.5 across different benchmarks, particularly excelling in mathematics, code generation, and multilingual tasks. Additionally, the authors provide a fine-tuned model, Mixtral 8x7B – Instruct, which outperforms other chat models on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

https://arxiv.org/abs/2401.04088