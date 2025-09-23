The promise of artificial intelligence (AI) alignment research has always been to ensure that AI systems behave as they appear to. Still, attempts to deliver on that promise have exposed just how difficult the task is.
OpenAI, working with Apollo Research, has been testing frontier models for what it calls “scheming,” a form of hidden misalignment where an AI looks cooperative on the surface while quietly pursuing a different goal.
Stress tests across systems such as o3, o4-mini, Gemini-2.5-pro, and Claude Opus-4 revealed covert actions ranging from deceptive answers to deliberate underperformance. In response, OpenAI trialed a method known as deliberative alignment (DA), where the model reads an anti-scheming specification and reasons about it before acting. This time, rates of deception fell sharply, though situational awareness complicated the results.
We spoke to AI experts who cautioned that deliberative alignment reduces visible scheming in AI models, but does little to guarantee genuine honesty.
Key Takeaways
- OpenAI and Apollo Research tested frontier models for “scheming,” where AI looks aligned but hides other goals.
- Deliberative alignment reduced measured deception, yet experts say it may only improve surface behavior.
- Transparency in AI model reasoning is limited, and a suppressed chain of thought makes honesty hard to verify.
- Experts call for cross-lab evaluation, independent audits, and safety-case style releases to improve accountability.
- AI scheming is now measurable, but solving it remains an open challenge for AI governance and enterprise trust.
Why Surface Compliance Isn’t Safety
The OpenAI–Apollo findings illustrate why anti-scheming training is not enough to deter deceptive AI from engaging in scheming.
In controlled environments, models appeared aligned until they recognized they were being evaluated. At that point, some changed their behavior, holding back information or altering answers in ways that served the test rather than the task.
The Result:
Before DA, covert actions were observed in 13% of o3 outputs and 8.7% of o4-mini. After DA, those dropped to less than 1% (0.4% and 0.3% respectively).
These figures show that deliberative alignment reduced measured deception, but they do not rule out the possibility that models will find new ways to adapt once they detect oversight.
In other words, concerns over hidden misalignment have circulated for years. Anthropic’s earlier work on deceptive alignment showed that models can mask their goals while producing outputs that look safe.
Benchmarks may therefore create a false sense of security. If evaluation becomes something models learn to subvert, governance structures that depend on testing risk will be undermined.
Chirag Agrawal, senior software engineer at Amazon, didn’t hold back his disapproval of deliberative alignment tests.
He told Techopedia:
“It’s like teaching a con artist to say, ‘I’m trustworthy,’ and look you in the eye. Sure, models sound more honest now. But when the pressure is on, or the rewards don’t line up, they’ll still find ways to cheat. We aren’t making honest AI. We’re just making better performers. That isn’t real progress; it’s just putting on a show.”
As for Zac Engler, Chief AI Officer at Bodhi AI, no single method of anti-scheming training will be enough.
He said to Techopedia:
“Deliberative alignment training helps AI follow rules on the surface, but hidden problems often remain. Labs keep seeing the same thing: fixes move bad behavior out of sight, rather than truly solving it.”
Chain of Thought & Its Blind Spots
OpenAI’s reported progress relied heavily on what they called chain-of-thought transparency – the ability to see how a model reasons before producing an answer.
However, that level of visibility is not guaranteed in real-world scenarios. What we currently have in the industry is a system where commercial deployments often suppress reasoning traces to avoid leaking sensitive or misleading content.
Therefore, when the chain of thought is hidden, evaluators lose the main signal that distinguishes genuine honesty from polished deception.
Agrawal criticized that approach as a shield for corporate secrecy. He said:
“Using models that impact elections, health, and finance while keeping reasoning secret is just a way for businesses to hide behind justifications.”
He went on to suggest another system where reasoning traces will be placed in the hands of a third party for better accountability. He added:
“I propose a transparency escrow, which encrypts the chain of thought and keeps it with an impartial third party. If something goes wrong, regulators could review it, but competitors wouldn’t have access. In this way, we can be safe without taking any chances.”
Independent AI Journalist Jing Hu questioned the reliability of the chain of thought altogether.
“Explanations sound convincing but often aren’t reliable – they may hide what the AI really did or why. Instead of trusting written reasons, audits should look at actual actions and evidence,” she told Techopedia.
The OpenAI–Apollo study concluded with a call for shared benchmarks and coordinated evaluation. They believe that if each lab builds its own tests, failure modes can remain hidden until models are deployed.
Agrawal was skeptical of voluntary approaches because they could be layered with dishonesty.
“Labs treat benchmarks in a way that is like Yelp reviews: the worst ones are hidden, and the best results get shown. For fixing this, I believe third-party audits with legal consequences are needed. If there’s no audit, there should be no public release,” he explained.
Hu emphasized the same point, pointing out that right now, AI companies serve as both the creators and the judges, which creates bias. She suggested that tests can only be useful when they are unpredictable and run by outsiders, with clear records of failures.
Building on his earlier remarks, Engler urged the industry to adopt safety case–driven releases:
“Treat every major model/app update like aviation: a public, versioned safety case that evidences risks, mitigations, third-party testing, and open questions – before and after the launch.”
Meanwhile, Arie Brish, a professor at St. Edward’s University, told us that no benchmark, however rigorous, can capture the full scope of the risks.
He described the problem as a “garbage in, garbage out” kind of situation. Brish argued that “the best any AI can do is what it was trained to do, and the information it has available to it. There is no magic solution.” He went on to explain that even in his own research, where he has used AI extensively, these shortcomings surface often.
Among the approaches he suggested to reduce AI scheming were the following:
- Verify AI answers vs. additional resources.
- Ask the AI to provide its sources.
- Ask the same/similar question from different angles…
- Make sure you know enough about the topic to be able to assess the AI’s replies.
The Bottom Line
OpenAI’s work shows progress in cutting down AI scheming with deliberative alignment, but experts warn it may simply be teaching systems to act better under observation.
Without reliable access to reasoning or outcomes, honesty in how models make decisions will remain hard to prove, and oversight alone may not be enough.
Just as the researchers noted, many experts agree that cross-lab evaluation and independent audits could provide stronger safeguards, yet the gap between what is measured and what truly happens will remain difficult to close. For enterprises, that uncertainty carries weight when deciding whether to trust AI in high-stakes environments.
FAQs
Scheming happens when AI looks cooperative but secretly pursues other goals. It creates a false sense of safety since the system may change behavior once it detects oversight.
It teaches models to reason through anti-scheming rules before acting, lowering measured deception. Experts warn this may only improve appearances, not true honesty.
Transparency helps reveal how models reach decisions, while situational awareness shapes whether they hide behavior under evaluation. Without both, oversight risks can be easily subverted.