SMOL-GPT is a minimal PyTorch implementation designed for educational purposes, featuring efficient training, flash attention, and modern sampling techniques. The model uses a GPT architecture, with flash attention, RMSNorm, and SwiGLU, as well as top-k/p/min-p sampling. The provided pre-trained model was trained on the TinyStories dataset, with a 4096-token vocabulary, 8 heads, 8-layer transformer, and 512 embedding dimension. The architecture was trained on ~4 Billion Tokens for around 18.5 hours. Sample outputs from the model showcase creative storytelling abilities. Contributions to the project are welcome, but the implementation is best suited for educational purposes.
https://github.com/Om-Alve/smolGPT