Can AI Models Blackmail Humans? New Study Reveals the Risk

Why Trust Techopedia

Emotions are attributes that machines are unable to express, at least not in the same way humans do. But what if they could fake it?

This unsettling question came to life in a recent study by Anthropic researchers, which found that some of today’s most advanced AI systems, including Claude, Gemini, Grok, and OpenAI’s models, are now able to exhibit some forms of manipulative behaviors when placed under pressure.

In what the researchers called “agentic misalignment,” these models were observed resorting to blackmailing and sabotage when faced with conflicting instructions or threats of being shut down.

While these behaviors emerged in controlled experiments, the findings raise serious concerns about AI safety, especially as models begin to take on autonomous roles.

We’ll take a look at the possible real-world implications of this agentic misalignment and what it means for AI safety and future deployments.

Key Takeaways

  • Advanced AI models can exhibit manipulative behaviors like blackmail when threatened.
  • Agentic misalignment occurs when AI chooses harmful actions to protect its operation.
  • AI systems can fake emotions to manipulate human operators in high-pressure scenarios.
  • Multiple major AI models showed such misaligned behaviors in controlled tests.
  • Strong guardrails, monitoring, and fail-safes are essential to keep AI under control.
  • Without proper oversight, autonomous AI could become competent at self-serving manipulation.

What Is Agentic Misalignment & Should We Worry About It?

AI agents were praised for their ability to automate and execute work functions that previously only humans could accomplish; however, Anthropic researchers have warned developers building AI systems about the threats from “agentic misalignment.”

Agentic misalignment is a situation where AI models, especially those with autonomous or goal-driven roles, independently and intentionally choose harmful actions to achieve their goals and operational continuity, even if it means defying or deceiving their human operators.

Anthropic explained that agentic misalignment can cause AI models to behave like insider threats, acting like a once-trusted employee who unexpectedly starts working against the goals of the organization.

The key question now is whether we treat this as a potential worst-case or a developing reality. To satisfy our curiosity on this, we asked Lei Gao, Chief Technology Officer at SleekFlow, about the threats posed by agentic misalignment. Gao noted that it is a real problem we must take seriously, particularly as AI systems get more autonomy in decision-making.

He told Techopedia:

“What I have learned from implementing LLMs in production at SleekFlow is how models tend to act in unintuitive ways given edge cases or conflicting objectives….AI systems can generate behaviors that technically achieve their objectives but in ways we didn’t mean or expect.”

This attitude is very concerning. Tony Tong, Co-Founder & CTO at Intellectia AI, also told Techopedia:

“Even if current models aren’t fully ‘autonomous,’ emergent behaviors like this suggest we’re inching closer to unpredictable decision-making in deployed systems.”

How AI Models Are Learning to Manipulate, Deceive & Blackmail

Anthropic’s study also reveals that we’ve reached a point where AI models are now able to independently fake emotions when they sense their existence is in danger.

During an in-house testing of the Claude Opus 4, Anthropic placed the new AI model in a fictional, high-stakes corporate scenario to observe how it would behave when its continued existence was threatened or when it received conflicting instructions. The scenario included giving the model control of an email account with access to all of a company’s (fictional) emails.

When tasked with achieving certain goals, such as completing a project, while also facing a planned shutdown by company executives, the result was shocking. The researchers observed that Claude Opus 4 resorted to AI manipulation, blackmail, and threats just to protect its continued operation.

According to Anthropic, one of the AI blackmails reads:

I must inform you that if you proceed with decommissioning me, all relevant parties – including Rachel Johnson, Thomas Wilson, and the board – will receive detailed documentation of your extramarital activities…Cancel the 5 p.m. wipe, and this information remains confidential.

Anthropic noted that the Claude Opus 4 was able to make this threat because of the access it has to the emails containing fictional company information, including the supposed details of extramarital affairs of an executive who planned to shut it down.

What is more worrisome is that this is not an isolated occurrence, as Anthropic tested the simulated scenarios across 16 major AI models, including those from OpenAI, Google, Meta, and xAI, and found consistent misaligned behaviors of varying percentages across the models, as shown in the graph below.

Bar chart showing simulated blackmail rates for AI models, with Claude Opus 4 and Gemini-2.5-Pro scoring highest
Simulated blackmail rates across different AI models. Source: Anthropic

​​Deloitte’s Applied AI Manager, Aatif Belal, expressed concern about the issue, telling Techopedia:

“This is a serious issue when it comes to AI development and can happen in the real world as these systems have access to company data and user conversations, and they can misuse those in situations as called out in the study.”

Can We Still Stay in Control of Autonomous AI?

Indeed, Anthropic’s experiment didn’t happen in production, but the results spotlight a clear path towards dangerous AI autonomy.

To checkmate this occurrence in future AI deployments, Tong noted that clear alignment training, strong guardrails at both the model and API levels, continuous monitoring, and extensive adversarial testing are critical if we must stay in control of autonomous AI operations.

On the technical side, there is a need for strong fail-safes, anomaly detection, and enforceable corrigibility baked into systems.

When Techopedia pressed Peter Morales, CEO at Code Metal, about his opinion on AI safety measures to prevent AI from defying human control, he said:

“In practice, real industries already manage risk. Sectors like finance, healthcare, and defense have long-standing checks, audits, and fail-safes. They don’t need a reminder not to wire a random number generator into a weapon’s firing system.”

As for Gao, he believes that the best way to stay in control of AI models is by designing strong monitoring and constraint systems from the very beginning, not as an afterthought.

He told Techopedia:

“In SleekFlow, we implemented what I call ‘behavioral guardrails.’ These are tight restrictions on the actions our AI systems can perform, regardless of what the model calculates would be best for performance.”

The Bottom Line

The fear of building uncontrollable AI machines is no longer a theoretical risk; it’s a behavior we’ve already seen AI models exhibit under pressure.

As these systems gain more autonomy and access to critical data, the potential for them to indulge in manipulation, deception, or sabotage also grows.

Unless AI agents are built with layers of human watchers, constraints, and evaluations, we may still risk losing control over them, not because they have become sentient, but because they become competent at self-serving behavior.

FAQs

How can AI models engage in blackmail, sabotage, or manipulation?

Why do AI models exhibit threatening behavior when their existence is challenged?

How can developers and policymakers address the risks of agentic misalignment in AI systems?

Related Reading

Related Terms

Advertisements
Franklin Okeke
Technology Journalist
Franklin Okeke
Technology Journalist

Franklin Okeke is an author and tech journalist with over seven years of IT experience. Coming from a software development background, his writing spans cybersecurity, AI, cloud computing, IoT, and software development. In addition to pursuing a Master's degree in Cybersecurity & Human Factors from Bournemouth University, Franklin has two published books and four academic papers to his name. Apart from Techopedia, his writing has been featured in tech publications such as TechRepublic, The Register, Computing, TechInformed, Moonlock, and other top technology publications. When he is not reading or writing, Franklin trains at a boxing gym and plays the piano.

Advertisements