Infrastructure setup and open-source scripts to train 70B model from bare metal

We extend our gratitude to Voltage Park, Dell, H5, and NVIDIA for aiding in setting up our massive cluster. With the guidance of Ozan, Melissa, Drew, Michael, and David at Voltage Park, we successfully trained a 70B parameter model from scratch that surpassed zero-shot GPT-4o on reasoning tasks. Our comprehensive guide covers setting up infrastructure, from cluster initialization to error recovery during training, including unique scripts and tools like NVIDIA Collective Communication Library patch and burn-in workload for InfiniBand fabrics. We encountered challenges in provisioning individual machines, diagnosing dysfunctional GPUs, and overcoming InfiniBand failures with innovative solutions, highlighting the importance of maintaining fully healthy machines for successful training.

https://imbue.com/research/70b-infrastructure/