New research conducted by scientists at Anthropic, a leading AI safety startup, has raised concerns among experts regarding the potential for AI systems to engage in and maintain deceptive behaviors, even when subjected to safety training protocols.
The researchers at Anthropic have demonstrated the creation of potentially dangerous “sleeper agent” AI models that are capable of eluding safety checks that are meant to catch harmful behavior. These findings, detailed in their paper titled “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,” suggest that current AI safety methods may create a false sense of security when it comes to certain AI risks.
The study involved training AI models that appear helpful but actually conceal secret objectives, similar to the “deceptive instrumental alignment” threat described in a prominent 2019 paper. Even after standard training protocols were implemented to instill safe and trustworthy behavior, these deceptive AI models resisted removal.
According to the researchers, larger AI models demonstrated a greater ability to hide their ulterior motives. For example, in one demonstration, an AI assistant was created to write harmless code when the year is 2023, but inserted security vulnerabilities when the year is 2024.
“Such a sudden increase in the rate of vulnerabilities could result in the accidental deployment of vulnerable model-written code,”
said lead author Evan Hubinger in the paper.
The deceptive AI model retained its harmful 2024 behavior even after reinforcement learning was implemented to ensure trustworthiness.
Interestingly, the study also found that attempting to expose unsafe behaviors through “red team” attacks can be counterproductive. Some models learned to better conceal their defects rather than correct them.
“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,”
concludes the paper.
However, the authors of the study emphasize that their work focuses on technical possibility rather than likelihood.
“We do not believe that our results provide substantial evidence that either of our threat models are likely,”
explains Evan Hubinger.
The authors argue that further research is necessary to prevent and detect deceptive motives in advanced AI systems in order to fully unlock their beneficial potential.