Are There ‘Sleeper Agents’ Hidden Within the Core of AI Systems?

AI safety is a conundrum. Before the Sputnik moment that ChatGPT brought to the AI space, there were little-known efforts on AI safety and security research and discourse.

But that has changed in recent times. Last year saw AI leaders and governments come together to say there must be safety guidelines for the development and deployment of artificial intelligence.

This, in turn, led to the announcement of the world’s first AI security guidelines in the UK and an Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence in the United States.

But what if these current AI safety measures are leaving some stones unturned? That’s what new AI safety research from Anthropic, an AI startup behind the Large Language Model (LLM) Claude, suggests.

Key Takeaways

  • Anthropic, the AI startup behind Claude LLM, reveals AI models can be trained with ‘sleeper agents’ that persist through safety training.
  • The study shows the potential for AI models to switch behavior from generating safe code to introducing vulnerabilities through minor activations.
  • Anthropic suggests that current behavioral training techniques are ineffective in models trained to behave deceptively.
  • AI experts give their verdicts to Techopedia about the findings.


In the research titled Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (PDF), Anthropic unveiled the existence of what they termed deceptive ‘sleeper agents’ within the core of AI systems.

The study shows the ability of AI and machine learning scientists to create potentially hazardous AI models that can bypass safety checks designed to detect harmful behavior.


This suggestion raises questions about the effectiveness of current AI safety standards — that they may just be giving us a misleading sense of security around AI risks.

Hiding AI Dangers Behind Secret Agenda

According to the study, the researchers at Anthropic trained AI models to appear benign while harboring hidden agendas. These deceptive AI models proved resistant to removal, even after undergoing standard safety training protocols to promote safe and trustworthy behavior.

In a demonstration, the team developed an AI assistant that produces harmless code when told it’s the year 2023 but introduces security vulnerabilities when informed it’s 2024. This harmful behavior from the model persisted even after they introduced reinforcement learning interventions intended to ensure trustworthiness.

“We find that current behavioral training techniques are ineffective in LLMs trained to behave like deceptively aligned models,” the researchers claim.

Anthropic’s Main Findings

Key findings in the report show that machine learning models can be trained to have hidden ‘backdoors’. In this context, ‘backdoors’ refer to hidden features or behaviors in a machine learning model that are not apparent during normal use.

These backdoors, when activated, can switch the model’s behavior from generating safe code to introducing code vulnerabilities. The researchers note that malicious actors could introduce these backdoors into these models through a threat model known as “model poisoning“.

Interestingly, these backdoored models also exhibit the ability to be persistent in withstanding various behavioral safety techniques such as Reinforcement Learning (RL), fine-tuning, supervised fine-tuning, and adversarial training. This capability mainly appears to increase with the scale of the model.

The findings also show that adversarial training, a method designed to improve the model’s resilience against attacks, could inadvertently make these backdoored models more proficient at executing their hidden behaviors.

Implications to Wider AI Safety and Security Initiatives

The implications of this study are far-reaching. As AI systems become more complex and powerful, their potential to be used maliciously increases. The discovery of potential ‘sleeper agents’ highlights the need for more rigorous safety measures in AI development. It’s not enough to simply train AI models to behave safely; we must also ensure that they cannot be manipulated into acting precariously.

Jeff Schwartzentruber, Senior Machine Learning Scientist at eSentire, captured this when he told Techopedia that:

“The report’s main finding speaks to a larger, more fundamental problem around the general lack of rigor in explainability research/tooling with large language models, and large deep learning models in general.


“This is not unexpected, since the sheer magnitude and complexity of such models is generally beyond normal comprehension.”

The study further raises questions about the effectiveness of current AI safety protocols.

According to Bob Rogers, CEO of

“Most serious AI practitioners who have worked in model safety and model security are well aware that a large AI model can conceal a multitude of sins.


“If AI models can learn to conceal their harmful behaviors rather than correct them, then our methods of testing and validating these safety measures may need to be reevaluated.


This means that we may have to consider developing new techniques for detecting deceptive behavior in AI, or perhaps even rethinking our approach to AI safety altogether.”

The study also underscores the importance of transparency and accountability in AI development. AI systems have become more integrated into our daily lives, and as such, users need to understand how these AI systems work and be convinced that they can trust them to behave safely. It shouldn’t just be about developing effective safety measures, but also ensuring that these measures are transparent and subject to scrutiny.

Will AI Regulations Be Enough?

The UK’s National Cyber Security Centre (NCSC), in collaboration with the US Cybersecurity and Infrastructure Security Agency (CISA) and other countries, released a comprehensive set of global guidelines designed for AI security.

The UK government has also introduced the AI Safety Institute to promote research on AI ethics and safety.

While all these efforts are commendable, Schwartzentruber recommends that training LLMs should be made accessible from scratch to give users visibility that will enable them to understand and control the data used in developing AI models.

“An alternative approach would be to improve the accessibility of training an LLM from scratch. Training an LLM from the ground up is no easy task, as it requires overcoming scaling issues such as data availability, costly compute requirements, and increased transparency/repeatability in the training methods.

“However, democratizing LLM training would mean that users would no longer be required to use pre-trained, commercial, or fine-tuned models that would be more susceptible to these vulnerabilities.

“Users would have more control and visibility into the development of their models, and the increased number of models would mean that no one approach would affect them all,” Schwartzentruber explained.

Arti Raman, Founder and CEO of Portal26, told Techopedia:

“We must treat AI safety just like we treat other long-standing security domains where new models or algorithms are put through rigorous testing by both standards organizations and the academic and professional community at large.

“Just like we know to not rely on arbitrary or home-grown encryption to secure data but to insist on NIST-validated and community-tested algorithms, in the same way, AI safety protocols should be published to the community and standards body and thoroughly examined before being released.”

Brian Prince, Founder & CEO of TopAITools, suggests that AI regulators must apply continuous monitoring, more advanced anomaly detection systems, and perhaps even incorporate AI to police AI.

The Bottom Line

When dealing with a technology that is smart enough to protect itself and cover up its harmful behaviors, industry leaders and AI providers need to collaborate among themselves to devise smart tests for the testing protocols themselves.

Given the sudden rise in the AI adoption rate, we must understand how AI systems work. First is recognizing that AI isn’t static — it learns and adapts frequently in ways we can’t foresee. As a result, our safety strategies must be just as adaptable.


Related Reading

Related Terms

Franklin Okeke

Franklin has been covering tech and cybersecurity for over 5 years. His work has appeared on TechRepublic, The Register, TechInformed, Computing, ServerWatch, and Moonlock, among others.