OpenAI's O3 Model: A Deep Dive into Reasoning and the ARC AGI Breakthrough

Introduction to OpenAI's O3 Model

The artificial intelligence landscape has been significantly altered with the unveiling of OpenAI's O3 model. This model represents a substantial leap forward in reasoning capabilities, particularly following the advancements made with the O1 model. The O3 model, starting with the O3-mini variant, is slated for public release by the end of January 2025. Many experts have hailed 2024 as a year of AI consolidation, where several participants have reached parity with GPT-4 and are now actively exploring practical applications for these models. The emergence of O3 has brought a new wave of excitement, surpassing even the anticipation surrounding the release of O1, signaling rapid progress in the domain of reasoning models.

The O1 model, while groundbreaking, faced skepticism regarding its applicability outside of domains like mathematics, programming, physics, and hard sciences. However, these models are poised for widespread use throughout the AI research ecosystem, promising to accelerate progress. The current challenge lies in exploring the full potential of these models, as well as developing public reinforcement learning training methodologies to expand their reach into other fields.

OpenAI's O3 has demonstrated that the industry is moving beyond the limitations of pre-training solely on internet text. It has achieved significant breakthroughs in reasoning evaluations, most notably:

Becoming the first model to exceed 85% completion on the ARC AGI prize, although on a public dataset and exceeding cost limits.
Demonstrating a remarkable jump from 2% to 25% on the new Frontier Math benchmark.
Showing substantial improvements across various leading programming benchmarks, like SWE-Bench-Verified.

These achievements, all within three months of the model's initial announcement, are poised to accelerate AI research significantly. The reduction in reasoning costs is also set to transform many software engineering roles as we know them today.

O3's Advances and the Importance of Consensus

OpenAI's O3 was revealed during the final day of their "12 Days of OpenAI" event, showcasing its ability to outperform prior state-of-the-art models like Gemini 1.5 Pro and Claude 3.5 Sonnet New across multiple domains.

A frequently overlooked detail in discussions about the O1 series models is the meaning of the shaded areas in their performance charts. The first blog post on O1 mentioned that solid bars represent the pass@1 accuracy, while the shaded areas indicate the performance achieved using majority voting (consensus) from 64 samples. This highlights the crucial role of multiple generation consensus for achieving the best performance with O1 models, emphasizing that optimal results cannot be obtained through a single output stream.

However, it's important to note that this doesn't necessarily require the use of tree search or any intermediate representation. The professional mode of O1, along with the ARC prize results, relies on this parallel generation to achieve the highest possible scores.

Frontier Math Benchmark and O3's Performance

The Frontier Math benchmark is a testament to the advanced capabilities of O3. Qualitative evaluations from Fields Medalists emphasize the challenge posed by this benchmark.

Terence Tao, a 2006 Fields Medalist, noted, "These problems are extremely challenging... I think they would stump AI for at least the next few years."
Timothy Gowers, another 2006 Fields Medalist, stated, "None of these problems are in my area of research, and they look completely unsolvable to me... They seem a level higher in difficulty than IMO (International Math Olympiad) problems."

Introduced on November 7th, the Frontier Math benchmark was regarded as one of the few remaining open frontiers in AI capabilities. The release of O3 has positioned it as the only model to reach a double-digit score, with a leap to 25%.

Programming Prowess and the ARC AGI Challenge

The second significant result for O3 is in the area of programming. OpenAI demonstrated a 71.7% score on SWE-Bench Verified, a current state-of-the-art achievement, along with impressive results on Codeforces, a programming competition website. Through consensus voting, O3 achieved a score of 2727, reaching the level of an International Grandmaster, placing it in the top 200 global human competitive programmers.

The O3-mini variant outperforms O1 while significantly reducing costs, and this may lead to the O3-mini becoming the more influential model for a broader user base. This breakthrough paved the way for the final achievement demonstrated in the O3 livestream: the successful solving of the ARC AGI challenge.

Addressing the ARC Evaluation

The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet in his 2019 paper, "On the Measure of Intelligence," is an AI assessment method designed to evaluate intelligence more closely to the way humans do. The ARC assessment focuses on:

A new formal definition of intelligence based on algorithmic information theory, describing intelligence as the efficiency of skill acquisition.
Emphasizing the concepts of scope, generalization difficulty, prior knowledge, and experience.
A set of design guidelines for a general AI benchmark.
The creation of the ARC, built with a set of explicit prior knowledge that closely approximates human innate prior knowledge.

The ARC AGI prize, launched in June 2024 with a $1 million reward, aimed to incentivize the first solution to meet specific criteria and solve a set of private ARC tasks. The benchmark for "solving" these tasks was an accuracy of 85%.

OpenAI's progress in this area is remarkable:

GPT-2 (2019): 0%
GPT-3 (2020): 0%
GPT-4 (2023): 2%
GPT-4o (2024): 5%
O1-preview (2024): 21%
O1 high (2024): 32%
O1 Pro (2024): ~50%
O3 tuned low (2024): 76%
O3 tuned high (2024): 87%

The speed of this progress has surprised many, including those optimistic about Q* and other reasoning methods.

Chollet provided further details on the ARC Prize website:

O3 was tested on two ARC-AGI datasets:
- A semi-private evaluation with 100 private tasks to assess overfitting.
- A public evaluation with 400 public tasks.
Testing was conducted at two compute levels with varying sample sizes: 6 (efficient mode) and 1024 (inefficient mode, 172 times more computation).

The high computation cost data for O3 is yet to be released, as pricing and functionality are still being determined.

O3's Architecture, Cost, and Training

The ARC AGI team collaborated directly with OpenAI to get model price estimates. Final pricing for O3, once officially launched in the API, may vary. The ARC-AGI team added an additional requirement for private evaluation submissions based on the importance of the inference scaling law. In their blog, the team recorded total costs and costs per task as a proxy for FLOPs, or direct computation of compute resource usage.

This aligns with a rule set for the public leaderboard in the ARC Prize announcement:

$10,000 USD is the upper limit on the run cost that can be spent to solve the 500 tasks (including 400 tasks in the public evaluation set and 100 tasks in a new semi-private evaluation set), including the cost of calling the commercial API.

O3's cost on the 500 tasks in the public or semi-public evaluation sets far exceeded this limit, with each query costing over $1,000.

Chollet speculated that the core mechanism of O3 seems to involve natural language program search and execution within the token space. The model searches the space of possible chains of thought (CoTs) that describe the steps required to solve a task, possibly guided by an evaluator model.

It's important to emphasize that the references and assumptions about MCTS (Monte Carlo Tree Search) are misleading. Many intelligent individuals are astonished by the ability of O1 and O3 to achieve such results through a single forward pass of a language model. OpenAI employees have also emphasized that O3 is "just a model trained through reinforcement learning."

The cost analysis, based on ARC team data and OpenAI’s O1 pricing ( $60.00/million output tokens), suggests that a complete O3 query costs around$ 5000. This implies that the model generates 80 million tokens per response, which is unlikely without significant improvements in long-context models.

O3's Evaluation Configurations

The ARC prize blog mentions that testing was conducted under two compute levels: 6 (efficient mode) and 1024 (inefficient mode, 172 times more computation).

According to SemiAnalysis, O1 pro uses self-consistency methods or simple consensus@N checks to improve performance by selecting the most common answer from multiple parallel responses to the same query. In this context, the sample size N likely corresponds to the consensus@N value, indicating that O3's evaluation configurations are similar to what customers can use with O1 pro: 6x compute and an ultra-high 1024x compute per question.

This scale of inference is unlikely to be available to standard paying users for some time. Most users will likely encounter results from a single generation to consensus@10, depending on the specifications of the O1 model's "professional" version.

Assuming the price of $60 per million output tokens, dividing by 1024 streams suggests that the model generates approximately 78,000 tokens per response. O3 appears to benefit from a larger base model, and these numbers are reasonable without implying additional "search" elements.

The core narrative in the progress of deep learning involves identifying a promising area and building upon it. The initial wave of progress came from internet-scale pretraining. OpenAI has now found a new direction by expanding reinforcement learning training and long-context inference. Given that O3 was released approximately three months after O1, the most straightforward explanation is that it employs the same architecture and training methods, just at a larger scale.

The Inference Scaling Law and the Future of O3

There is no evidence that O3 has changed its inference architecture by adding tree searches. The core rule of the inference scaling law is that sampling more content from the same single stream generation can lead to performance improvements.

A key question is whether O3's base model is Orion (OpenAI's internal code name, possibly GPT-5) or if the new base model only benefits from Orion during training. If the base model's size has increased by a factor of 2 to 5, then the data from the ARC prize's API pricing would be entirely consistent.

Despite the uncertainty, it’s clear that O1-level models are here to stay.

The Return of Reinforcement Learning

Earlier in the day, Anthropic released a video about the creation of Anthropic, featuring several co-founders. One unexpected detail was shared by co-founder and CEO Dario Amodei:

"...The whole reason to scale up these models is that their intelligence is not yet sufficient for us to do RLHF (Reinforcement Learning from Human Feedback) on top of them."

As one of the founders of the modern RLHF concept, Dario likely had an early intuition that advancements in fine-tuning techniques were coming. This view of RLHF's potential is more extensive and profound than the perceptions of most practitioners.

This year has undoubtedly seen the re-establishment of reinforcement learning (RL) and related methods as central to artificial intelligence. The writing of this article has led me to believe that I need to train a reasoning-based language model like this in 2025. It feels as though, for tech companies in 2024, standard pre-training has become a basic requirement for the industry. It is foreseeable that O1-style models will be the default tool in the AI toolbox for a long time to come. I am very excited to embrace this new worldview and learn firsthand how these models are trained.