o3 makes ChatGPT an agent

OpenAI just released the full o3 model (and o4-mini!); here's how the combined reasoning and tool usage is another move towards an agentic future

Apr 16, 2025

OpenAI just launched o3 and o4-mini, the newest models in its reasoning-centric o-series. They mark a clear shift toward tool-augmented agents, blending advanced reasoning with full access to ChatGPT’s existing toolkit.

Giving the o-series a toolkit

The o-series emphasizes reasoning depth over speed, designed to tackle complex prompts with more deliberate computation than standard GPT models.

From announcement to reality

The original o1 model demonstrated impressive reasoning capabilities, particularly in STEM fields. The o3-mini, released earlier, brought improved performance while maintaining efficiency. Now, with today's release, o3 and o4-mini take a leap forward by introducing comprehensive tool use capabilities.

These models when and how to utilize all of ChatGPT’s existing tools:

Web search for accessing up-to-date information
Python code generation and execution for data analysis
Visual reasoning for interpreting images, charts, and diagrams
Image generation capabilities

This represents an expected shift toward more agentic AI systems that can independently execute tasks on users' behalf.

Capabilities and performance

o3: Setting New Standards for AI Reasoning

o3 is OpenAI’s current reasoning flagship, outperforming prior models across various benchmarks:

Coding and software development
Mathematical problem-solving
Scientific reasoning
Visual perception and analysis

The model has achieved state-of-the-art performance on numerous benchmarks including:

Codeforces
SWE-bench (without specialized scaffolding)
MMMU (multimodal understanding)

In evaluations by external experts, o3 makes 20 percent fewer major errors than OpenAI o1 on difficult, real-world tasks, with particular excellence in programming, business/consulting, and creative ideation. Early testers highlighted its analytical rigor and ability to generate and critically evaluate novel hypotheses in fields like biology, math, and engineering.

o4-mini: Efficient reasoning at scale

OpenAI o4-mini represents a smaller, more efficient model optimized for fast, cost-efficient reasoning. o4-mini trades size for speed and throughput, but still beats o3-mini on most tasks:

On AIME 2025, o4-mini scores 99.5 percent when given access to a Python interpreter, effectively saturating this benchmark
It outperforms its predecessor, o3-mini, on non-STEM tasks and specialized domains like data science
Its efficiency supports significantly higher usage limits than o3, making it ideal for high-volume applications

Both models demonstrate improved instruction following and provide more useful, verifiable responses than their predecessors, thanks to enhanced intelligence and the inclusion of web sources. They also feel more natural and conversational, especially as they reference memory and past conversations.

The power of reinforcement learning

Training o3 involved scaling reinforcement learning to levels previously only used in pretraining. The results confirm what you'd expect: more compute, more time, better output. OpenAI has pushed an additional order of magnitude in both training compute and inference-time reasoning, and continues to see clear performance gains.

Key technical highlights include:

Training the models to use tools through reinforcement learning—teaching them not just how to use tools, but to reason about when to use them
Integration of visual reasoning directly into the chain of thought
Optimizing for cost-efficiency while maintaining high performance

Multimodal reasoning with images

Visual content is now part of the reasoning loop. Upload a diagram, sketch, or blurry photo and o3 can interpret and act on it with the same logic it applies to text. This unlocks a new class of problem-solving that blends visual and textual reasoning.

The models can interpret these visuals (even if blurry, reversed, or low quality) and use tools to manipulate images on the fly as part of their reasoning process. This capability has resulted in best-in-class accuracy on visual perception tasks.

An agentic toolkit for ChatGPT

OpenAI o3 and o4-mini have full access to tools within ChatGPT, as well as custom tools via function calling in the API. These models are trained to reason about problem-solving approaches, choosing when and how to use tools to produce detailed and thoughtful answers in the right output formats—typically in under a minute.

For example, when asked "How will summer energy usage in California compare to last year?" the model can:

Search the web for public utility data
Write Python code to build a forecast
Generate a graph or image visualization
Explain the key factors behind the prediction

This flexible, strategic approach allows the models to tackle multi-faceted questions more effectively by:

Accessing up-to-date information beyond the model's built-in knowledge
Performing extended reasoning and analysis
Synthesizing information from multiple sources
Generating outputs across different modalities

The models can also adapt their approach as they work, searching the web multiple times with the help of search providers, evaluating results, and trying new searches if more information is needed.

The boring bits

Cost-efficient reasoning

Beyond their enhanced capabilities, o3 and o4-mini are often more efficient than their predecessors. On benchmarks like the 2025 AIME math competition, the cost-performance frontier for o3 strictly improves over o1, and similarly, o4-mini's frontier strictly improves over o3-mini.

For most real-world usage scenarios, OpenAI expects that o3 and o4-mini will be both smarter and cheaper than o1 and o3-mini, respectively, delivering better value to users.

Comprehensive protections

OpenAI rebuilt its safety dataset and stress-tested these models on its latest risk framework. Both o3 and o4-mini stay below all tracked capability thresholds for:

Biological threats (biorisk)
Malware generation
Jailbreak attempts

This refreshed data has led to strong performance on internal refusal benchmarks. Additionally, OpenAI has developed system-level mitigations to flag dangerous prompts in frontier risk areas, including a reasoning LLM monitor that works from human-written safety specifications.

Both models underwent OpenAI's most rigorous safety testing program to date, evaluated across the three tracked capability areas covered by their updated Preparedness Framework: biological and chemical, cybersecurity, and AI self-improvement. Both o3 and o4-mini remain below the Framework's "High" threshold in all three categories.

Frontier reasoning in the terminal with Codex CLI

Alongside the model releases, OpenAI has launched Codex CLI, a lightweight coding agent that runs directly from the terminal. This tool is designed to maximize the reasoning capabilities of models like o3 and o4-mini, with upcoming support for additional API models like GPT-4.1.

Codex CLI enables:

Multimodal reasoning from the command line
Processing screenshots or sketches
Direct access to local code
A minimal interface connecting models to users and their computers

The tool is fully open-source and available today at github.com/openai/codex, and likely intends to compete with Anthropic’s Claude Code. Additionally, OpenAI stated there are more agentic coding tools coming in this same vein. Look out Cursor and Windsurf!

Access and availability

Access to the new models is being rolled out across different tiers of OpenAI's products:

o3 and o4-mini replace earlier models across all paid ChatGPT tiers starting today
Enterprise and Edu get access next week
Free users can try o4-mini via the "Think" toggle

Both o3 and o4-mini are also available to developers today via the Chat Completions API and Responses API, with some developers needing to verify their organizations to access these models. The Responses API supports:

Reasoning summaries
Preserving reasoning tokens around function calls for better performance
Soon-to-be-added built-in tools like web search, file search, and code interpreter

OpenAI expects to release o3-pro in a few weeks with full tool support, while Pro users can continue to access o1-pro in the interim.

The future direction

Today's releases reflect the direction OpenAI's models are heading: converging the specialized reasoning capabilities of the o-series with more of the natural conversational abilities and tool use of the GPT-series. By unifying these strengths, future models can support seamless, natural conversations alongside proactive tool use and advanced problem-solving.

This release represents yet another step toward more agentic AI systems that can independently execute complex tasks while maintaining a natural, conversational interface. As AI continues to evolve, finding the right balance between specialized reasoning and broad applicability remains crucial, with o3 and o4-mini demonstrating substantial progress on this frontier.