Building Meta’s GenAI infrastructure

Meta is investing in their AI future with two 24k GPU clusters for extracting high throughput and reliability. They use open compute and open source tools like Grand Teton, OpenRack, and PyTorch. Marking one step towards their ambitious infrastructure roadmap, Meta aims to have 350,000 NVIDIA H100 GPUs by 2024. With a focus on building artificial general intelligence responsibly, Meta’s clusters support current and next-gen AI models, including Llama 3. They have optimized network and storage solutions to support large-scale training efficiently. Meta’s commitment to open AI innovation includes supporting OCP designs and contributing to PyTorch. Their goal is to create flexible systems to support evolving AI models and research efficiently.

https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/