Transformer Debugger (TDB) is a tool by OpenAI’s Superalignment team that uses interpretability techniques and sparse autoencoders to investigate small language models’ behaviors. TDB allows for quick exploration without coding, intervening in the forward pass to observe behavior changes. It can answer questions like why a model outputs one token over another or why a specific attention head focuses on a certain token. The tool identifies key components contributing to behavior, generates explanations on component activations, and helps discover connections between components. The release includes a React app hosting TDB, an activation server for model inference, and example datasets. Setup instructions and validation steps are provided, along with links for further information.
https://github.com/openai/transformer-debugger