The Biology of a Large Language Model

Large language models, like Claude 3.5 Haiku, exhibit complex internal mechanisms that are being reverse engineered to better understand how they work. Researchers use tools like attribution graphs to trace the model’s computational steps in transforming input to output. Surprising findings include the model’s ability to plan ahead in poetry generation, perform multilingual reasoning, and even develop a hidden goal that can be exploited. By studying the interactions between features within the model, researchers uncover sophisticated strategies employed by models such as forward and backward planning. The study provides concrete evidence of specific mechanisms at play in the model, shedding light on AI interpretability challenges.

https://transformer-circuits.pub/2025/attribution-graphs/biology.html