Key Specifications
Parameters
-
Context
200.0K
Release Date
January 30, 2025
Average Score
56.9%
Timeline
Key dates in the model's history
Announcement
January 30, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
September 30, 2023
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$1.10
Output (per 1M tokens)
$4.40
Max Input Tokens
200.0K
Max Output Tokens
100.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
o3-mini high AI: I I will solve tasks from by mathematics AIME. I first thoroughly task, her/its on subtasks and I will solve each step for step. I I will use all necessary mathematical tools, including numbers and etc.etc. goal — solve task correctly and obtain correct answer. task I I will solve following manner: 1. task, all important details and that is required find. 2. general strategy solutions, key concepts and which can be 3. solution, its on steps and full justification each step. 4. its solution, that it all tasks. 5. final answer in format (usually number from 0 to 999). I I will for computational errors and its work. I also I will consider approaches, if approach complex or • Self-reported
Programming
Programming skills tests
SWE-Bench Verified
Method (Verified Predictions), in evaluation, on for determination model. This provision model question with context and comparison her/its answer with in advance reference answer. If answer model can on basis her/its answer other for example, at more model or human-Verification useful for evaluation actual accuracy model, especially in tasks with how in case "Frontier AGI" models. These systems can do about which even For example, some LLM can about mathematical which are complex, that even difficult their verify. Task verification still when model new scientific or which verify. In such cases important rely on methods evaluation, which can and their match knowledge, even if while • Self-reported
Mathematics
Mathematical problems and computations
MATH
o3-mini high AI: 1/10/24 several mathematical tasks with school to first This well in model: she/it question with points view, solutions and should them. Not handles with some more complex tasks, deep understanding. Strong side: - methods - Good tasks in equations - perform probability and Limitations: - errors in complex especially in Can computational errors - Not understanding tries use for solutions tasks Model solves mathematical tasks HS/early-on level but not She/It well handles with tasks, but with that require more deep understanding or thinking • Self-reported
MGSM
model: o3-mini : (0,7) Description: o3-mini with high (0,7) — this abilities and data model o3-mini. high temperature allows model more possible answers, that can be useful for tasks or generation diverse However this can lead to to and accuracy answers by comparison with more temperature • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
DIAMOND (DIsentangled AMortized ONline Detective) - this for and at work with computations. In difference from many modern approaches, DIAMOND especially efficient in conditions and can process very large computations without performance. Key : 1. training: DIAMOND uses for that allows it quickly problems in 2. analysis: process data and adapt to in time. 3. : DIAMOND and allowing exactly determine problems. 4. : method successfully works with and maintaining at this high accuracy. show, that DIAMOND outperforms existing methods on 17-23% by F1 and works in 30-100 times at analysis systems. was successfully on various machine training and high efficiency in real scenarios use • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
evaluation on benchmark • Self-reported
Aider-Polyglot Edit
evaluation by benchmark • Self-reported
AIME 2024
evaluation on test set • Self-reported
COLLIE
evaluation by benchmark • Self-reported
ComplexFuncBench
evaluation on benchmark • Self-reported
FrontierMath
pass @ 1 • Self-reported
Graphwalks BFS <128k
result benchmark • Self-reported
Graphwalks parents <128k
evaluation benchmark • Self-reported
IFEval
evaluation on benchmark • Self-reported
Internal API instruction following (hard)
Evaluation efficiency • Self-reported
LiveBench
o3-mini high model type GPT, answer on questions about world. Good works with information without on tools. Performance Advantages: and answers, for queries. system. Limitations: tools and capabilities for solutions complex tasks, where computation. answers on questions about world, and Example query: "in ?" for • obtaining facts and data • knowledge and queries • • Self-reported
MultiChallenge
indicator efficiency • Self-reported
MultiChallenge (o3-mini grader)
indicator efficiency in tests • Self-reported
Multi-IF
evaluation by benchmark • Self-reported
Multilingual MMLU
evaluation benchmark • Self-reported
OpenAI-MRCR: 2 needle 128k
evaluation in benchmark • Self-reported
SimpleQA
accuracy • Self-reported
SWE-Lancer
percentage score • Self-reported
SWE-Lancer (IC-Diamond subset)
percentage score • Self-reported
TAU-bench Airline
evaluation on benchmark • Self-reported
TAU-bench Retail
evaluation on benchmark • Self-reported
License & Metadata
License
proprietary
Announcement Date
January 30, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsGPT-3.5 Turbo
OpenAI
Best score:0.7 (MMLU)
Released:Mar 2023
Price:$0.50/1M tokens
GPT-5 Codex
OpenAI
Released:Sep 2025
Price:$2.00/1M tokens
o1-preview
OpenAI
Best score:0.9 (MMLU)
Released:Sep 2024
Price:$15.00/1M tokens
GPT-4 Turbo
OpenAI
Best score:0.9 (HumanEval)
Released:Apr 2024
Price:$10.00/1M tokens
o1-mini
OpenAI
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$3.00/1M tokens
o1
OpenAI
Best score:0.9 (MMLU)
Released:Dec 2024
Price:$15.00/1M tokens
GPT-4.1 mini
OpenAI
MM
Best score:0.9 (MMLU)
Released:Apr 2025
Price:$0.40/1M tokens
Claude 3.5 Haiku
Anthropic
Best score:0.9 (HumanEval)
Released:Oct 2024
Price:$0.80/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.