Artificial intelligence (AI) faces a troubling paradox in 2025: as AI reasoning models become more sophisticated in mathematical capabilities, they’re simultaneously generating more false information than ever before.
OpenAI’s latest reasoning systems, according to their own report, show hallucination rates reaching 33% for their o3 model and a staggering 48% for o4-mini when answering questions about public figures, more than double the error rate of previous systems.
This alarming trend challenges the fundamental assumption that more powerful AI systems are inherently more reliable. This goes way past technical curiosities. Businesses are increasingly using AI reasoning models in critical operations. Elevated error rates have serious risks to decision-making processes and public trust in AI.
Key Takeaways
- OpenAI’s o3 system hallucinates 33% of the time, twice the rate of its predecessor, o1, despite improved mathematical abilities.
- OpenAI acknowledges that “more research is needed” to understand why hallucinations worsen as reasoning models scale up.
- Reasoning models from OpenAI, Google, and Chinese startup DeepSeek are all generating more errors, not fewer, as they become more powerful.
- AI hallucinations have caused business disruptions, with incidents such as Cursor’s AI support bot falsely announcing policy changes, leading to customer cancellations.
- Many experts believe hallucinations are intrinsic to the technology.
The Reasoning Revolution Backfires
Reasoning models perform complex tasks by breaking questions down into individual steps. This is like a human thought process. It is not merely spitting out text based on statistical models of probability. These systems represent the peak of AI development at the moment, designed to think through problems methodically before providing answers.
The promise was compelling: AI that could match PhD-level performance in physics, chemistry, and biology while excelling in mathematics and coding. OpenAI’s first reasoning model, o1, was claimed to match PhD student performance in multiple scientific disciplines and beat them in math and coding through reinforcement learning techniques.
However, 2025 has revealed a disturbing reality.
On OpenAI’s PersonQA benchmark, which tests knowledge about public figures, o3 hallucinated 33% of the time compared to o1’s 16% rate, while o4-mini performed even worse at 48%.
When tested on broader general knowledge questions, AI hallucination rates for o3 and o4-mini reached 51% and 79%, respectively, compared to o1’s 44%.
These aren’t isolated incidents. Third-party testing by Transluce, a nonprofit AI research lab, found evidence that o3 fabricates actions it claims to have taken, including claiming to run code on hardware it cannot access.
Why AI Reasoning Problems Are Getting Worse
One theory is that the very strength of reasoning models – namely, their step-by-step approach – may be their weakness. Experts theorize that newer models trained to reason through problems bit by bit introduce new chances to go wrong at each step. Each reasoning step becomes a potential failure point where the model can veer into fabrication.
Another theory is that AI training prioritizes providing answers over admitting ignorance, leading models to generate incorrect information rather than acknowledging uncertainty.
This training approach may be particularly problematic for reasoning models designed to work through complex problems, where uncertainty should trigger more cautious responses.
At a fundamental level, large language models (LLMs) work by compressing data, squeezing relationships between tens of trillions of words into billions of parameters, inevitably losing information in the process.
While models can reconstruct about 98% of their training data accurately, in that remaining 2%, they might give a completely false answer.
Researchers hypothesize that the reinforcement learning techniques used for reasoning models may amplify issues that are usually mitigated by standard post-training processes.
Real-World Impact & Consequences
AI hallucinations and their problems are now beyond academic concerns and in the real-world business. Cursor, an AI coding assistant platform, experienced customer backlash when its AI support bot falsely announced a policy change limiting software usage to one computer, leading to angry customer complaints and cancellations before the company clarified the misinformation.
The legal sector has already witnessed serious consequences, such as the Mata v. Avianca case, where a New York attorney relied on ChatGPT for legal research, resulting in fabricated case citations and quotes that a federal judge noted were nonexistent.
In high-stakes applications involving court documents, medical information, or sensitive business data, hallucinations create serious verification challenges, with users forced to spend significant time determining which responses are factual.
Can AI Hallucinations Be Fixed? Emerging Mitigation Strategies
The industry is developing various approaches to combat hallucinations:
- Retrieval-augmented generation (RAG): RAG integrates AI models with reliable databases, enabling real-time access to accurate information and representing a widely adopted method for enhancing AI accuracy.
- Multi-agent systems: Recent research explores the use of multiple specialized AI agents to review and refine outputs. Structured communication between agents helps detect unverified claims and clarify speculative content.
- Enhanced training data: Models trained on carefully curated, high-quality datasets show significantly lower hallucination rates compared to those trained on unfiltered internet data.
- Self-verification techniques: Simple prompting strategies that encourage models to question their own responses have proven effective, with techniques like asking models to verify their outputs reducing hallucination rates by notable margins.
A multi-billion-dollar market for third-party AI verification tools has emerged. It shows us both the severity of the problem and the industry’s commitment to addressing it.
Some progress is evident. Google’s Gemini-2.0-Flash-001 achieved an industry-leading hallucination rate of just 0.7% in 2025, showing that significant improvements are possible even with reasoning techniques and extensive knowledge verification.
AI Hallucination Leaderboard
Model | Hallucination rate | Factual consistency rate |
---|---|---|
Google Gemini-2.0-Flash-001 | 0.7 % | 99.3 % |
Google Gemini-2.0-Pro-Exp | 0.8 % | 99.2 % |
OpenAI o3-mini-high | 0.8 % | 99.2 % |
Vectara Mockingbird-2-Echo | 0.9 % | 99.1 % |
Google Gemini-2.5-Pro-Exp-0325 | 1.1 % | 98.9 % |
Google Gemini-2.0-Flash-Lite-Preview | 1.2 % | 98.8 % |
OpenAI GPT-4.5-Preview | 1.2 % | 98.8 % |
Zhipu AI GLM-4-9B-Chat | 1.3 % | 98.7 % |
Google Gemini-2.0-Flash-Exp | 1.3 % | 98.7 % |
Google Gemini-2.5-Flash-Preview | 1.3 % | 98.7 % |
OpenAI-o1-mini | 1.4 % | 98.6 % |
OpenAI GPT-4o | 1.5 % | 98.5 % |
Amazon Nova-Micro-V1 | 1.6 % | 98.4 % |
OpenAI GPT-4o-mini | 1.7 % | 98.3 % |
OpenAI GPT-4-Turbo | 1.7 % | 98.3 % |
Source: Hugging Face, as of April 29, 2025
This architectural limitation suggests that while hallucinations can be reduced, complete elimination may remain elusive.
What Is the Path Forward?
Major tech companies are taking the problem seriously. OpenAI has already acknowledged that addressing AI hallucinations remains an active research priority across all their models, with ongoing efforts focused on improving accuracy and reliability.
One particularly promising avenue involves giving reasoning models web search capabilities.
OpenAI’s GPT-4o achieved 90% accuracy on benchmarks when equipped with web search.
This could potentially improve reasoning models’ hallucination rates in cases where users accept the privacy trade-offs of exposing prompts to search providers.
Some optimistic projections suggest that hallucinations could decline each year, potentially reaching near-zero levels by 2027.
Still, these views are based on traditional scaling trends that may not apply to reasoning models, which appear to buck historical improvement patterns.
The Bottom Line
The 2025 surge in AI hallucinations among reasoning systems is critical for the industry. While these models possess unprecedented mathematical and logical capabilities, their tendency to fabricate information at a higher rate than previous generations poses a significant threat.
Although various mitigation strategies show promise, from RAG systems to multi-agent verification, the fundamental architecture of current AI may make complete hallucination elimination impossible.
FAQs
What causes AI to hallucinate?
What are AI reasoning problems?
Will AI hallucinations go away?
Can AI hallucinations be fixed?
References
- OpenAI o3 and o4-mini System Card (Cdn.openai)
- Investigating truthfulness in a pre-release o3 model (Transluce AI)
- Company apologizes after AI support agent invents policy that causes user uproar (Ars Technica)
- Here’s What Happens When Your Lawyer Uses ChatGPT (NYTimes)
- Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks (Arxiv)