o3-mini

Name: o3-mini
Author: OpenAI

OpenAI

A smaller version of O3 that is expected to offer improved multimodal capabilities, more advanced logical reasoning, and more efficient resource usage compared to previous models, while maintaining high performance on core tasks.

Key Specifications

Parameters

Context

200.0K

Release Date

January 30, 2025

Average Score

56.9%

API Documentation Repository Results Blog

Timeline

Key dates in the model's history

Announcement

January 30, 2025

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

September 30, 2023

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$1.10

Output (per 1M tokens)

$4.40

Max Input Tokens

200.0K

Max Output Tokens

100.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

o3-mini high AI: I I will solve tasks from by mathematics AIME. I first thoroughly task, her/its on subtasks and I will solve each step for step. I I will use all necessary mathematical tools, including numbers and etc.etc. goal — solve task correctly and obtain correct answer. task I I will solve following manner: 1. task, all important details and that is required find. 2. general strategy solutions, key concepts and which can be 3. solution, its on steps and full justification each step. 4. its solution, that it all tasks. 5. final answer in format (usually number from 0 to 999). I I will for computational errors and its work. I also I will consider approaches, if approach complex or • Self-reported

86.9%

Programming

Programming skills tests

SWE-Bench Verified

Method (Verified Predictions), in evaluation, on for determination model. This provision model question with context and comparison her/its answer with in advance reference answer. If answer model can on basis her/its answer other for example, at more model or human-Verification useful for evaluation actual accuracy model, especially in tasks with how in case "Frontier AGI" models. These systems can do about which even For example, some LLM can about mathematical which are complex, that even difficult their verify. Task verification still when model new scientific or which verify. In such cases important rely on methods evaluation, which can and their match knowledge, even if while • Self-reported

49.3%

Mathematics

Mathematical problems and computations

MATH

o3-mini high AI: 1/10/24 several mathematical tasks with school to first This well in model: she/it question with points view, solutions and should them. Not handles with some more complex tasks, deep understanding. Strong side: - methods - Good tasks in equations - perform probability and Limitations: - errors in complex especially in Can computational errors - Not understanding tries use for solutions tasks Model solves mathematical tasks HS/early-on level but not She/It well handles with tasks, but with that require more deep understanding or thinking • Self-reported

97.9%

MGSM

model: o3-mini : (0,7) Description: o3-mini with high (0,7) — this abilities and data model o3-mini. high temperature allows model more possible answers, that can be useful for tasks or generation diverse However this can lead to to and accuracy answers by comparison with more temperature • Self-reported

92.0%

Reasoning

Logical reasoning and analysis

GPQA

DIAMOND (DIsentangled AMortized ONline Detective) - this for and at work with computations. In difference from many modern approaches, DIAMOND especially efficient in conditions and can process very large computations without performance. Key : 1. training: DIAMOND uses for that allows it quickly problems in 2. analysis: process data and adapt to in time. 3. : DIAMOND and allowing exactly determine problems. 4. : method successfully works with and maintaining at this high accuracy. show, that DIAMOND outperforms existing methods on 17-23% by F1 and works in 30-100 times at analysis systems. was successfully on various machine training and high efficiency in real scenarios use • Self-reported

77.2%

Other Tests

Specialized benchmarks

Aider-Polyglot

evaluation on benchmark • Self-reported

66.7%

Aider-Polyglot Edit

evaluation by benchmark • Self-reported

60.4%

AIME 2024

evaluation on test set • Self-reported

87.3%

COLLIE

evaluation by benchmark • Self-reported

98.7%

ComplexFuncBench

evaluation on benchmark • Self-reported

17.6%

FrontierMath

pass @ 1 • Self-reported

9.2%

Graphwalks BFS <128k

result benchmark • Self-reported

51.0%

Graphwalks parents <128k

evaluation benchmark • Self-reported

58.3%

IFEval

evaluation on benchmark • Self-reported

93.9%

Internal API instruction following (hard)

Evaluation efficiency • Self-reported

50.0%

LiveBench

o3-mini high model type GPT, answer on questions about world. Good works with information without on tools. Performance Advantages: and answers, for queries. system. Limitations: tools and capabilities for solutions complex tasks, where computation. answers on questions about world, and Example query: "in ?" for • obtaining facts and data • knowledge and queries • • Self-reported

84.6%

MultiChallenge

indicator efficiency • Self-reported

39.9%

MultiChallenge (o3-mini grader)

indicator efficiency in tests • Self-reported

50.2%

Multi-IF

evaluation by benchmark • Self-reported

79.5%

Multilingual MMLU

evaluation benchmark • Self-reported

80.7%

OpenAI-MRCR: 2 needle 128k

evaluation in benchmark • Self-reported

18.7%

SimpleQA

accuracy • Self-reported

15.0%

SWE-Lancer

percentage score • Self-reported

18.0%

SWE-Lancer (IC-Diamond subset)

percentage score • Self-reported

7.4%

TAU-bench Airline

evaluation on benchmark • Self-reported

32.4%

TAU-bench Retail

evaluation on benchmark • Self-reported

57.6%

License & Metadata

License

proprietary

Announcement Date

January 30, 2025

Last Updated

July 19, 2025

Similar Models

All Models

GPT-3.5 Turbo

OpenAI

Best score:0.7 (MMLU)

Released:Mar 2023

Price:$0.50/1M tokens

GPT-5 Codex

OpenAI

Released:Sep 2025

Price:$2.00/1M tokens

o1-preview

OpenAI

Best score:0.9 (MMLU)

Released:Sep 2024

Price:$15.00/1M tokens

GPT-4 Turbo

OpenAI

Best score:0.9 (HumanEval)

Released:Apr 2024

Price:$10.00/1M tokens

o1-mini

OpenAI

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$3.00/1M tokens

o1

OpenAI

Best score:0.9 (MMLU)

Released:Dec 2024

Price:$15.00/1M tokens

GPT-4.1 mini

OpenAI

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$0.40/1M tokens

Claude 3.5 Haiku

Anthropic

Best score:0.9 (HumanEval)

Released:Oct 2024

Price:$0.80/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.