DeepSeek-V3 Technical Report

DeepSeek-V3 is a cutting-edge language model with 671 billion parameters, utilizing Multi-head Latent Attention and DeepSeekMoE architectures. It introduces an innovative training strategy for load balancing and multi-token prediction, resulting in superior performance compared to other models. Despite its impressive capabilities, DeepSeek-V3 is efficient, requiring only 2.788 million GPU hours for training. Notably, the training process is remarkably stable, with no loss spikes or rollbacks. The model outperforms open-source models and rivals leading closed-source models. Checkpoints are available for evaluation. This revolutionary model promises significant advancements in language processing.

https://arxiv.org/abs/2412.19437