llama3 is implemented from scratch in this file, with one tensor and matrix multiplication at a time. Tensor loading is done directly from the provided model file for llama3, requiring weights download before running the file. The model has specificities such as tokenizer implementation and reading file structure details. Token embeddings are normalized using RMS normalization and the first layer of the transformer is built with attention implemented from scratch. A unique feature includes the Rotory Positional Embedding use for positional encoding of queries. Additionally, attention heads and query vector operations are detailed for a deep dive into the implementation process.
https://github.com/naklecha/llama3-from-scratch