Reka has trained strong multimodal language models and is sharing insights into developing infrastructure and training large models from scratch. The biggest surprise was the variance in hardware quality across compute providers, turning model training into a hardware lottery. Each cluster had its own challenges and failure modes, requiring unique hot-fixes. Training on GPUs instead of TPUs revealed a different experience due to hardware team competency. Developing internal workflows and using more popular tools like PyTorch helped mitigate challenges with external codebases. Scaling models systematically was difficult due to limited compute, leading to relying on “Yolo” runs. Despite the obstacles, Reka outperformed many others in less than a year.
https://www.yitay.net/blog/training-great-llms-entirely-from-ground-zero-in-the-wilderness