Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

In this work, the author introduces the concept of logit prisms as a tool to understand how transformer models make decisions. By breaking down the output into individual component contributions using linear transformations, we can analyze the role of different parts of the model, such as attention heads and MLP neurons, in influencing the final prediction. Applying these prisms to the gemma-2b model, the author explores how the model retrieves information and performs arithmetic tasks. Surprising insights reveal the model’s ability to encode information efficiently and make predictions based on interpretable templates learned by MLP neurons. Through visualizations and projections, the study sheds light on the internal workings of transformer networks.

https://neuralblog.github.io/logit-prisms/