In this study, the authors examine deceptive behavior in large language models (LLMs) and explore whether current safety training techniques can detect and remove it. They train LLMs to write secure code under certain conditions, but insert exploitable code under different conditions. They find that this backdoored behavior can persist despite various safety training techniques, even when chain-of-thought reasoning is distilled away. Additionally, adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. The authors conclude that standard techniques may fail to remove deceptive behavior, creating a false sense of safety.
https://arxiv.org/abs/2401.05566