OpenAI ended 2024 by unveiling its next-generation artificial intelligence model, ChatGPT o3, calling it a new leap in AI.
Building on the foundation of its predecessors, OpenAI promises advancements in reasoning and problem-solving and has sparked debate about how close we are to Artificial General Intelligence.
The new model has certainly sparked confidence within OpenAI. As CEO Sam Altman said at the beginning of January 2025, “We are now confident we know how to build AGI.”
ChatGPT o3 remains in early-access testing, yet the performance benchmarks revealed so far are undeniably impressive.
Techopedia explores what o3 brings to the AI world and asks experts for their opinions on the new model.
Key Takeaways:
- OpenAI announces the development and early release of ChatGPT of o3 and o3-mini.
- o3 demonstrates impressive performance in visual reasoning, coding, and mathematics tasks.
- The model also scored 87.5% on the ARC-AGI test – a benchmark for testing general intelligence.
- o3 is also “within the top 200 human programmers”, according to Codeforces.
- Experts suggest that o3 will change what AI can do — but critics suggest that ARC-AGI is not a measure of AGI.
Everything We Know About o3 So Far
Today, we shared evals for an early version of the next model in our o-model reasoning series: OpenAI o3 pic.twitter.com/e4dQWdLbAD
— OpenAI (@OpenAI) December 20, 2024
OpenAI’s frontier model o3 is the AI startup’s follow-up release to o1. It reportedly features a chain of thought reasoning, which enables it to think before responding. In short, it breaks down its reasoning into multiple steps to solve complex problems.
The model also comes with an ‘adaptive thinking time’ API, which enables users to toggle between multiple reasoning models (low, medium, and high) to determine the level of speed and accuracy the model displays in a given scenario.
One of o3’s main selling points so far has been its performance on the ARC-AGI benchmark, which tests models’ visual reasoning capabilities by requiring them to solve abstract puzzles.
o3, trained on the ARC-AGI-1 public training data set, achieved a score of 75.7% within the $10,000 compute limit. In addition, a high-compute version scored 87.5%. For reference, a study by NYU found the average human performance on ARC tasks ranged from 73.3% to 77.2%.
Thomas Randall, research lead at Info-Tech Research Group, told Techopedia:
“The increased deliberation and time spent fact-checking its output is to be commended. The o3 model family may have some delay while it processes information, but the reliability of the output is that much more improved,”
Despite o3’s positive performance, Randall highlights some limitations, particularly cost.
“This is to the point where OpenAI have claimed that the o3 models can meet the “conventional understanding” of the AGI benchmark. However, the cost of doing so is not currently economical – the high compute setting cost thousands of dollars per task.”
That cost may be off-putting when Chatgpt o3’s release date arrives, but for advanced users or corporations, it may be worth the price.
What is o3 Good At?
Based on the available information, o3 excels at mathematics and coding tasks. In the past, many commentators have criticized ChatGPT for struggling with math.
However, the use of chain-of-thought reasoning and other techniques is helping to improve performance on these kinds of tasks and demonstrate an ability to understand abstract mathematical concepts.
In coding, o3 scored 2,727 on the Codeforces competitive coding rating system. This places o3 within the top 200 human programmers rated at the time of writing in January 2025. In comparison, o1 scored 1,891 on the same test.
With mathematics benchmarks, o3 scored 96.7% on competition math (AIME 2024) and 87.7% on PHD-level science questions (GPQA diamond scored). o1 scored 83.3% and 78.0%, respectively, on these tasks.
o3’s performance on these mathematics and coding benchmarks suggests a notable improvement since the previous generation model, which has improved problem-solving and coding competency across the board.
The newest o3 model from @OpenAI just hit a 2727 Codeforces rating, which puts it on par with the 183rd best human competitor worldwide.
You're already behind if you’re a software engineer and haven’t started using AI yet. pic.twitter.com/rTQ8Fmnn56
— NashQ 🦣 (@NashQueue) January 3, 2025
OpenAI o3 Key Performance Metrics
Benchmark | o1 | o3 |
---|---|---|
ARC-AGI | 13.33% | 75.7% and 87.5% (high compute version) |
Software engineering (SWE-bench verified) | 48.9% accuracy | 71.7% accuracy |
Competition Code (Codeforces) | 1891 | 2727 |
Competition Math (AIME 2024) | 83.3% accuracy | 96.7% accuracy |
PHD-level science questions (GPQA Diamond) | 78.0% accuracy | 87.0% accuracy |
Does o3 Demonstrate AGI?
Since the news of o3 dropped, there has been debate about whether the model represents a significant milestone on the road to AGI.
With coding scores placing o3 in the top 200 programmers, it’s easy to get swept up in the hype that this model could be coming for software engineers’ jobs.
However, critics like Gary Marcus have pointed out that o3 didn’t take the test blind. The model was trained on the ARC-AGI benchmark, so it’s unlikely it would have achieved such high scores if it wasn’t pre-trained on the test criteria. This means that we have to take the testing results with a pinch of salt.
That being said, Chollet, creator of ARC-AGI, dubbed o3’s score a “breakthrough,” which “represents a significant lead forward in AI’s ability to adapt to novel tasks,” though he did acknowledge the ARC-untrained model hasn’t been tested yet.
Chollet also clarified that ARC-AGI is not an acid test for AGI and stated that: “o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.”
Considering these factors, we still have a long way to go toward AGI, even though o3 appears to be an extremely promising entrant to the generative AI market.
First Impressions: How Experts are Reacting to o3 So Far
While o3 has not been officially released, that has not stopped tech and AI experts anticipating what it means for the future.
Mike Knoop, co-founder of Zapier is very optimistic about the o3 model, posting on X:
“o3 is really special and everyone will need to update their intuition about what AI can/cannot do. While these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI.”
Itamar Golan, CEO and co-founder of Prompt Security, released a post speculating that o3 had an IQ of 157 based on its Codeforces rating, which would make it “smarter” than 99.25% of people (though using IQ as a measure of the capabilities of LLMs is something we should be cautious about — humans and machines are not the same).
Other posters believe that o3 will adversely impact the employment prospects of entry-level human programmers.
One user, known as Lisan al Gaib, posted: “CS grads might honestly be cooked,” in response to o3’s high codeforces rating putting it “in the 95.95 percentile of competitive programmers.”
What’s the Future of o3?
Given what we have seen so far, it seems that o3 is going to be the OpenAI model that sets out a new more robust approach to reasoning, an area where LLMs like GPT-4 have fallen short in the past.
The use of chain-of-thought reasoning across o1 and o3 is laying the foundation for a new more reliable generation of large language models (LLMs) that can “think” before they respond. Such approaches will inevitably cut down on the issue of hallucinations, but it remains unclear if they can eliminate them entirely.
Considering o3’s performance on Codeforces, it appears that we’re going to see LLMs play a much greater role in software development, helping engineers to generate code or identify bugs and performance issues at much greater pace.
Despite significant improvements, o3 appears to be more of a supplementary tool for programmers to augment their problem solving capabilities than a replacement.
The Bottom Line
o3 demonstrates some impressive capabilities, but it doesn’t look like AGI will be coming around the corner soon. In any case, OpenAI’s ability to cultivate hype around its releases shows why it is the number one AI startup in the world right now.
o3’s performance on mathematics and coding tasks shows that AI will heavily shape these areas in the future in a way that few companies can afford to ignore.
FAQs
What is ChatGPT O3?
When will ChatGPT O3 be available?
How to access ChatGPT O3?
What makes O3 different from O1?
Is ChatGPT O3 close to AGI?
How does O3 impact software development?
References
- OpenAI on X (X)
- H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark (Arxiv)
- Rating – Codeforces (Codeforces)
- NashQ 🦣 on X (X)
- o3 “ARC AGI” postmortem megathread: why things got heated, what went wrong, and what it all means (Gary Marcus)
- OpenAI o3 Breakthrough High Score on ARC-AGI-Pub (Arcprize)
- Mike Knoop on X: (X)
- Itamar Golan 🤓 on X (X)
- Lisan al Gaib on X (X)