The project aims to enhance interpretability in language models by using Sparse Autoencoders (SAEs) to extract clear, interpretable features from the latent space of the Llama 3 model. The process involves untangling complex superimposed representations within the Llama 3 model to reveal distinct concepts represented by individual neurons, improving model understanding and optimizing information flow. By capturing and training on carefully processed activation data, the project provides valuable insights into model behavior and feature interpretability. Notably, the project offers a comprehensive pipeline for training SAEs, analyzing learned features, and verifying results experimentally, highlighting a novel approach to enhancing the interpretability of large language models.
https://github.com/PaulPauls/llama3_interpretability_sae