Our team has trained sparse autoencoders (SAEs) on Llama 3.3 70B and made the interpretive model available through an API, creating the most accessible and capable model with interpretability tools. By providing these tools, we hope to facilitate new research and product development. Through our exploration of the feature space of Llama 3.3-70B, we have discovered a wide range of concepts in this latent space, including physics and programming clusters. Interestingly, we have observed that feature steering may impact factual recall, prompting further investigation. While we moderate potentially harmful features, we also recognize the value of unmoderated access for research purposes. This work lays the foundation for a Responsible Scaling Plan.
https://www.goodfire.ai/papers/mapping-latent-spaces-llama/