Representation Engineering: Mistral-7B on Acid

In October 2023, a group of authors introduced Representation Engineering: A Top-Down Approach to AI Transparency, focusing on “Representation Engineering” to interpret or control AI model behavior. Their work, published in May 2023, explored steering GPT-2-XL. The Responsible AI Safety and Interpretability researchers also considered utilizing control vectors for various scenarios, like making models power-seeking or happy. By releasing the code on Github, they aimed to control models without prompt engineering. The process involves creating dataset pairs, applying PCAs, and training control vectors in about a minute, showcasing the power and potential of control vectors in model manipulation.

https://vgel.me/posts/representation-engineering/