In Gall’s Law, a complex system evolves from a simple system that worked. This post introduces Rotary Positional Encoding (RoPE) in transformer models, like LLama 3.2, to enhance self-attention with positional information. The method uses sinusoidal embeddings based on sin and cos functions, following an iterative approach that ultimately leads to RoPE. The key properties that guide the optimal encoding scheme include uniqueness, linearity, adaptability, determinism, and extensibility. By exploring absolute vs. relative position encoding, the post emphasizes that what matters most in understanding language is the relationships between words rather than their absolute positions.
https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding