The TinyLlama project is focused on pretraining a 1.1B Llama model on 3 trillion tokens. The goal is to complete this training within 90 days using 16 A100-40G GPUs. The project adopted the same architecture and tokenizer as Llama 2, making it compatible with many open-source projects built upon Llama. TinyLlama has 1.1B parameters, making it compact and suitable for applications with restricted computation and memory requirements. The project provides a release schedule for intermediate checkpoints and showcases its progress. It also highlights potential use cases for TinyLlama, such as assisting speculative decoding of larger models and enabling real-time dialogue generation in video games. The training details include information about the model’s parameters, attention variant, sequence length, batch size, and hardware used. The codebase used in the project supports various features such as multi-gpu and multi-node distributed training, flash attention, fused layernorm, fused swiglu, fused cross entropy loss, and fused rotary positional embedding. These optimizations result in high training throughput and reduced memory footprint. The project also compares the training speed of TinyLlama with other models. Additionally, the inference speed of TinyLlama is discussed. The
https://github.com/jzhang38/TinyLlama