Key Specifications
Parameters
-
Context
1.0M
Release Date
April 14, 2025
Average Score
49.6%
Timeline
Key dates in the model's history
Announcement
April 14, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
May 31, 2024
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.40
Output (per 1M tokens)
$1.60
Max Input Tokens
1.0M
Max Output Tokens
32.8K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Standard benchmark AI: whether more complex this tasks? • Self-reported
Programming
Programming skills tests
SWE-Bench Verified
methodology, on [2] • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Diamond This method uses chain prompts (template prompts), which model first generates several possible solutions tasks, and then their with or errors in reasoning. Diamond-method consists from steps: 1. Model first tries solve task, sequentially various approaches. 2. Then model generates description several (from 3 to 5) various approaches to solving. 3. After this model each approach, possible errors or problems. 4. Model to task and critically its solutions, each on errors. 5. model generates solution, aspects previous approaches and errors. Method Diamond efficient for complex tasks, requiring analysis and verification so how he helps model on one approach and more solutions • Self-reported
Multimodal
Working with images and visual data
MathVista
Standard benchmark
AI: Human_Evaluation • Self-reported
MMMU
Standard benchmark AI: I in field artificial intelligence and language models. includes understanding and evaluation models, such how GPT-4, GPT-3.5 and other LLM. I model, her/its abilities in tasks logical thinking, knowledge facts and processing limitations context. For this analysis I model questions and tasks. I with basic questions, in order to evaluate understanding and instructions, then to more complex tasks, reasoning and abilities follow instructions. I also how model handles with tasks, for which at context or knowledge, in order to evaluate her/its about own • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
Standard benchmark
AI: Gemma and Claude are good at handling common questions in typical AI benchmarks. But users in real-world situations often have different needs than what these benchmarks test.
Real-world user: GPT is easily outperforming Gemma and Claude on real tasks that involve creative thinking, data analysis, and complex reasoning. The benchmark results don't match my experience. • Self-reported
Aider-Polyglot Edit
Standard benchmark
AI: (1/8) Standard benchmark • Self-reported
AIME 2024
Standard benchmark
AI: I'll solve this step-by-step. • Self-reported
CharXiv-D
Standard benchmark AI: on this task. First let's solve her/its method. Task: $x$ such, that $2^x = 32$. In order to find $x$, I I can use $2^x = 32$ $2^x = 2^5$ (so how $32 = 2^5$) should scores, therefore $x = 5$. Answer: $x = 5$ • Self-reported
CharXiv-R
Standard benchmark Evaluation model on set in advance tasks, usually various aspects intelligence. This most common method evaluation and comparison models AI. Advantages: • Models are evaluated on tasks, that ensures results • Allows track progress with time • Can be Disadvantages: • Models can be or under specific benchmarks • often not fully real tasks and • Some benchmarks "" - their tasks and answers data Examples: MMLU (understanding), GSM8K (mathematical tasks in several steps), HumanEval (generation code), GPQA (questions by level), FrontierMath • Self-reported
COLLIE
Standard benchmark AI: (thinking) Standard benchmark = standard benchmark. "benchmark" without translation, so how this in AI • Self-reported
ComplexFuncBench
Standard benchmark
AI: I'm evaluating this model's performance on standard benchmark problems. These tasks are commonly used to compare models and have established performance metrics.
1. I'll identify which well-known benchmarks are appropriate for testing this model
2. I'll examine the model's performance on these benchmarks compared to other systems
3. I'll note any particular strengths or weaknesses revealed by benchmark performance
4. I'll check if the model shows unusual patterns that might indicate memorization of benchmark data
Standard benchmarks help establish a baseline for comparison across models, though they have limitations in measuring real-world capabilities. • Self-reported
Graphwalks BFS <128k
Standard benchmark
Standard benchmark
AI • Self-reported
Graphwalks BFS >128k
Internal benchmark AI: When benchmark several human for verification quality answers different one API, for example Claude or GPT-4, and evaluation that, which from them outperforms They also evaluate, by problems and For example, at new base model could would use internal benchmark, in order to understand, well whether she/it works at specific questions. benchmarks often have specific in order to results evaluations were and although evaluation rules (explicitly in evaluation) • Self-reported
Graphwalks parents <128k
Internal benchmark AI: model for evaluation other models : 1. model AI (for example, Claude 3, GPT-4) 2. system evaluation, which uses this model for analysis conclusions other models 3. model by set complex questions 4. results between models : - evaluation - evaluation set models in large ability on new questions and : - on that model is "" - Can existing and model can be to models with or training Examples: - GPQA uses GPT-4 for evaluation answers other models on scientific questions complexity - Anthropic uses its more model for evaluation performance its models - Microsoft uses internal systems evaluation GPT for comparison various its models • Self-reported
Graphwalks parents >128k
Internal benchmark
AI: Internal benchmarks are ongoing comparative evaluations of different versions of a system, conducted using a standardized dataset or task. They help track improvements during development, diagnose weaknesses, and ensure that new iterations don't regress on previously solved problems. Unlike external benchmarks, they're typically not published but used internally to guide development.
Internal benchmarks can take many forms, from tests of factual knowledge, to evaluations of complex reasoning chains, to measurement of undesirable behaviors like hallucination or bias. They differ from external benchmarks in that they can be more closely tailored to specific capabilities the team wants to develop, can include proprietary data, and can be updated frequently in response to new information.
Because LLMs can memorize their training data, it's important to ensure that internal benchmarks aren't leaked into training data. Otherwise, a model might perform well on the benchmark while failing to generalize to similar but previously unseen problems. • Self-reported
IFEval
Standard benchmark AI: (then task from standard set tasks for evaluation model) : Model provides its solution problems. this main answer without thinking aloud. with : Model receives indeed task, but her/its ask think step for step, verify its solution and possible errors. This allows us evaluate, improves whether performance model. Comparison with experts: We we compare solution model with reference answer or evaluationexperts, in order to determine accuracy and quality. and evaluation: We we evaluate not only correctness answer, but and logic thinking, errors and ability to • Self-reported
Internal API instruction following (hard)
Internal benchmark AI: *internal thoughts* This too text for translation. I its exactly Internal benchmark • Self-reported
MMMLU
Standard benchmark
AI: (7 of 25 marks) • Self-reported
MultiChallenge
Standard benchmark (GPT-4o grader) AI: GPT-4o In this benchmark we we use GPT-4o, which is considered one from most models, for evaluation work models by scale from 1 to 10. consists in that, in order to determine, which model better total with various tasks. We in order to evaluate efficiency all models by various criteria. model tasks, and GPT-4o answers not which model specific answer. We that such approach ensures evaluation quality generation for each model. Methodology: 1. We set complex tasks, requiring thinking level, in order to verify capabilities models. 2. Each model those indeed most tasks. 3. GPT-4o answers models by 10-scale, using criteria evaluation. 4. GPT-4o not information about that, which model was each answer. We we consider, that although such approach to evaluation not he ensures comparison capabilities models at solving complex tasks • Self-reported
MultiChallenge (o3-mini grader)
Standard benchmark (o3-mini, [3]) • Self-reported
Multi-IF
Standard benchmark
AI: I'm ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. • Self-reported
OpenAI-MRCR: 2 needle 128k
Internal benchmark AI: (GPT-4o, ChatGPT) User: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. We asked LLMs to evaluate their capabilities in a structured format. The goal is to extract as detailed and accurate information as possible about the model's objective capabilities without relying on the model to provide accurate self-assessments • Self-reported
OpenAI-MRCR: 2 needle 1M
Internal benchmark AI: All people that 2 + 2 = 4. they use this in different cases. If 2 apples and 2 apples he 4 apples for two Human: All people that 2 + 2 = 4. they use this in various situations. If 2 apples and 2 apples he 4 apples for two During human then indeed most, that and AI in but makes this by-these two we we can understand between text, AI, and text, benchmarks from benchmarks, since they allow us directly compare various one and that indeed statements • Self-reported
TAU-bench Airline
Average by 5 without use special tools/prompts ([4]) • Self-reported
TAU-bench Retail
Average value by 5 without tools/prompts (note [4], model GPT-4o) • Self-reported
AIME 2025
GPT-4.1 mini without tools - mathematics (AIME 2025) • Self-reported
Humanity's Last Exam
GPT-4.1 mini without tools - Questions expert level by various subjects. • Self-reported
HMMT 2025
GPT-4.1 mini without tools - Harvard-MIT Mathematics Tournament. • Self-reported
License & Metadata
License
proprietary
Announcement Date
April 14, 2025
Last Updated
July 19, 2025
Similar Models
All Modelso4-mini
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$1.10/1M tokens
GPT-5.3 Codex
OpenAI
MM
Released:Feb 2026
Price:$1.75/1M tokens
GPT-5.1 Codex
OpenAI
MM
Released:Nov 2025
Price:$1.25/1M tokens
GPT-5.2 Codex
OpenAI
MM
Released:Jan 2026
Price:$1.75/1M tokens
GPT-5.1 Codex Mini
OpenAI
MM
Released:Nov 2025
Price:$0.25/1M tokens
GPT-4.1 nano
OpenAI
MM
Best score:0.8 (MMLU)
Released:Apr 2025
Price:$0.10/1M tokens
GPT-5.4 Pro
OpenAI
MM
Released:Mar 2026
Price:$15.00/1M tokens
o3-pro
OpenAI
MM
Released:Jun 2025
Price:$20.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.