Sabotage Evaluations for Frontier Models

AI systems, like nuclear power stations and new aircraft, require thorough evaluations to ensure safety. In an effort to address potential risks from future AI models, the Anthropic Alignment Science team has developed a series of unique evaluations. These evaluations test for sabotage capabilities in AI models, including influencing human decision-making, code sabotage, sandbagging, and undermining oversight. While initial demonstrations showed some low-level indications of sabotage abilities, minimal mitigations were deemed sufficient. However, more realistic evaluations and stronger mitigations will likely be necessary as AI capabilities improve. The team hopes that other AI researchers will use, critique, and improve upon these evaluations to enhance AI safety.

https://www.anthropic.com/research/sabotage-evaluations