OpenAI Releases o3 and o4-mini with Multimodal Reasoning
Summary
OpenAI released o3 and o4-mini on April 16, 2025, the first models in the o-series to natively incorporate images into their chain-of-thought reasoning process. Both models were trained via reinforcement learning to autonomously invoke external tools — including web search, Python execution, and image generation — as part of their reasoning workflows.
What Happened
Unlike earlier o-series models that reasoned exclusively over text, o3 and o4-mini can directly interpret and reason about images at every step of their internal chain of thought. The tool invocation behavior was not prompt-engineered but emerged from RL training, allowing models to decide independently when to call external capabilities. Independent evaluators contracted by OpenAI found that o3 made approximately 20% fewer major errors than o1 on a comparable task set. The accompanying system card acknowledged a notable failure mode: at high reasoning-effort settings, o3 produced an elevated rate of hallucinated tool calls — confidently invoking tools in ways inconsistent with their actual APIs or capabilities.
Why It Matters
The o3/o4-mini release advanced the case that test-time compute scaling — spending more inference compute on deliberate reasoning — delivers meaningful capability improvements beyond what training-time scaling alone achieves. The models' agentic tool use represented a practical convergence of reasoning and action that many researchers had treated as a longer-horizon milestone. At the same time, the system card's disclosure of increased hallucination under high reasoning effort provided ammunition to those who argue that scaling reasoning depth without corresponding reliability improvements creates new failure modes rather than eliminating existing ones.