GPT-4 details leaked?

GPT-4 is a massive model with over 10 times the size of GPT-3, boasting approximately 1.8 trillion parameters across 120 layers. OpenAI implemented a mixture of experts (MoE) model, utilizing 16 experts with about 111 billion parameters each. The routing algorithm for expert selection was allegedly simple compared to other advanced algorithms. GPT-4 was trained on around 13 trillion tokens, including text-based data and code-based data. The inference cost of GPT-4 is three times that of the 175B parameter Davinci model, mainly due to lower utilization and larger clusters. OpenAI possibly employs speculative decoding in its inference, using a smaller model to predict tokens before inputting them into the larger model. The dataset used for training includes data from CommonCrawl, RefinedWeb, as well as speculated sources like LibGen, Sci-Hub, and GitHub. OpenAI’s vision model is fine-tuned separately with an additional 2 trillion tokens. The inference runs on clusters with 128 GPUs, utilizing tensor and pipeline parallelism. The author also speculates about the missing dataset, suggesting it could be a custom collection of college textbooks.

https://threadreaderapp.com/thread/1678545170508267522.html