DeepSeek Open Source Optimized Parallelism Strategies, 3 repos

DeepSeek Infra shares profiling data from their training and inference framework, captured using the PyTorch Profiler, to help the community understand communication-computation overlap strategies and implementation details. The training profile shows a balanced MoE routing strategy. In inference prefilling, they use two micro-batches to overlap computation and communication, balancing attention computation load. Decoding profile details, including EP128, TP1, and a 4K prompt length with all-to-all communication that doesn’t occupy GPU SMs. The system waits for communication to complete after computation in decoding. More information can be found in DeepEP. The unique aspect is the focus on balancing communication and computation for efficient training and inference processes.

https://github.com/deepseek-ai/profile-data