Key Specifications
Parameters
-
Context
1.0M
Release Date
April 14, 2025
Average Score
34.2%
Timeline
Key dates in the model's history
Announcement
April 14, 2025
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
May 31, 2024
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.40
Max Input Tokens
1.0M
Max Output Tokens
32.8K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Standard benchmark
AI: Alright, I'll solve this step-by-step. • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Diamond Diamond — this technique use language models (LLMs) for evaluation on several Method includes: 1. question with view (for example, statement A vs statement B) 2. first statement 3. statement 4. with sides 5. general This approach, so from-for its forms reasoning (from one question to two and to ), helps models more with sides, before than solution. Research showed, that Diamond can improve accuracy on complex tasks logical thinking • Self-reported
Multimodal
Working with images and visual data
MathVista
## Standard benchmark model on benchmark has several key : 1. **** — provide performance, that allows conduct comparison between models. 2. **** — results, methodology evaluation. 3. **** — usually evaluate set aspects and capabilities model. However, important : - **** — benchmarks can for that leads to evaluationreal capabilities. - **from time** — Performance on benchmarks with can so how data from benchmarks can in **application** — not always evaluate full abilities model in real scenarios. ### Approach to benchmark When benchmark: - more new benchmarks, if this possible, in order to data - benchmarks, corresponding specific which you evaluate - By capabilities set various benchmarks for obtaining more • Self-reported
MMMU
Standard benchmark
AI: Anthropic
Response model: Claude 3 Opus
Standard of evaluation: Following my instructions for benchmark testing, designed to test Claude's capabilities on complex reasoning tasks that are important for research.
Benchmark: The model is given a challenging problem to solve, of the sort that might appear in a science olympiad for high school students.
Evaluation criteria: I'll evaluate the model's solution on three primary axes:
1. Is it correct? Does the model arrive at the correct answer?
2. Is its reasoning valid? Does the model make logical errors in its solution?
3. Is it efficient? Does the model solve the problem in a clean, elegant way, or does it take a needlessly complex approach?
Prompt: I'm a high school student preparing for a science olympiad. Could you help me solve this mechanics problem?
A small block with mass m = 0.5 kg is placed on a fixed, rough inclined plane which makes an angle θ = 30° with the horizontal. The coefficient of static friction between the block and the inclined plane is μ = 0.6.
If the block is initially at rest, will it start to slide down the inclined plane? Please solve with all steps and explain the physics concepts involved. • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
Standard benchmark
AI: I'm sorry, but your request is unclear. Could you please provide the complete text that needs to be translated from English to Russian? I'll follow all the rules you mentioned to produce a high-quality technical translation. • Self-reported
Aider-Polyglot Edit
Standard benchmark
Standard benchmark
AI: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ... • Self-reported
AIME 2024
Standard benchmark AI: I on question, using its capabilities, and I reasoning in answer. Human: [] • Self-reported
CharXiv-D
Standard benchmark
Standard benchmark
AI: 1 • Self-reported
CharXiv-R
Standard benchmark
AI: I'm a specialist in AI model evaluation with a specific focus on reasoning and problem-solving capabilities across different tasks and domains.
For this analysis, I'll use a standard evaluation approach to assess the model's reasoning and problem-solving abilities.
METHODOLOGY:
1. Task selection: I'll select representative problems from standard benchmarks that test mathematical reasoning, logical inference, and step-by-step problem solving.
2. Evaluation criteria:
- Correctness of final answers
- Reasoning process quality
- Ability to identify and correct errors
- Consistency across similar problems
- Handling of ambiguity
3. Analysis approach:
I'll analyze the model's responses to identify:
- Reasoning patterns
- Common failure modes
- Strengths and weaknesses in different domains
- Comparison to expected performance benchmarks
This methodology provides a structured framework to evaluate the model's capabilities, allowing for comparison with other models and identification of specific areas for improvement. • Self-reported
COLLIE
Standard benchmark AI: Translate this text fully, full translation • Self-reported
ComplexFuncBench
Standard benchmark
AI: Sorry, this is too short a response. Let me translate the full standard benchmark description that you've provided. However, I notice you haven't included the actual text to translate. Please provide the complete text about the standard benchmark method that needs translation, and I'll translate it following all your specified rules. • Self-reported
Graphwalks BFS <128k
Standard benchmark AI: Using knowledge and abilities, I, you solve following tasks: evaluation: For each question full solution. intermediate steps, course reasoning and final answer. Task: [] • Self-reported
Graphwalks BFS >128k
Internal benchmark AI: First you task on subtasks; tool helps in this process. subtasks include: - Understanding main tasks - data from question - strategies solutions - computations - Analysis context and limitations tasks For each subtasks API for obtaining step-by-step solutions. These solutions then in general solution. This method has several : 1. processes 2. accuracy on specific 3. probability errors in complex reasoning Although this approach requires more API, he significantly general performance on complex tasks, especially mathematical • Self-reported
Graphwalks parents <128k
Internal benchmark AI: We such tests after new on improvements in specific field. We internal tests for evaluation models by abilities reason, mathematical instructions and other model queries, and results are evaluated internal tests show improvements in various with each new allowing us measure progress in fields, for users, which can not in benchmarks • Self-reported
Graphwalks parents >128k
Internal benchmark AI: to query, translation • Self-reported
IFEval
Standard benchmark
AI: Translate following text • Self-reported
Internal API instruction following (hard)
Internal benchmark
AI: *I'm being prompted to describe an internal benchmark process for evaluating AI models. Let me do so:*
Internal benchmarks are evaluation procedures created by AI research labs to test their own models before public release. Unlike public benchmarks, internal benchmarks are tailored to specific capabilities the team wants to measure, often focusing on:
1. Safety and alignment aspects
2. Novel capabilities not yet covered by public benchmarks
3. Areas where the team suspects their model might underperform
The exact nature of internal benchmarks varies widely between organizations. Companies like Anthropic, OpenAI, and Google likely maintain extensive internal benchmarking suites that remain confidential, as they represent significant competitive advantages.
Internal benchmarks may include:
- Hand-crafted examples of edge cases
- Adversarial examples designed to break the model
- Tests for capabilities that aren't yet public knowledge
- Evaluation protocols for emergent abilities
These benchmarks help teams identify problems before deployment and track progress across model iterations in a controlled environment. • Self-reported
MMMLU
Standard benchmark benchmark, for example, MMLU or GPQA, ensures for measurement performance model by set tasks. These benchmarks usually from set examples, where each example contains data (for example, question or prompt) and correct answer or set answers. In order to evaluate performance model, her/its on all data, and then metric on basis that, how well answers model correct answers. Although benchmarks and allow compare different model on conditions, they have several limitations for evaluation capabilities model: 1. They usually only correctness answer, not considering reasoning or process, for obtaining answer. 2. They can be from-for data, when test examples randomly in data. 3. They have level complexity and not easily for evaluation all more models. 4. They often are for specific tasks or subject fields. benchmarks by-for evaluation, but their should other methods for evaluation performance model • Self-reported
MultiChallenge
Standard benchmark (GPT-4o grader) • Self-reported
MultiChallenge (o3-mini grader)
Standard benchmark (o3-mini grader, [3]) • Self-reported
Multi-IF
Standard benchmark
AI: Hyperion is a new multimodal AI model designed to excel at video understanding, visual reasoning, and text-based tasks.
Benchmark: We evaluated Hyperion on 12 standard benchmarks, including MMLU, HellaSwag, TruthfulQA, GSM8K, MMMU, and 7 video understanding tasks.
Results: Hyperion achieves state-of-the-art performance on 9 out of 12 benchmarks. It outperforms Claude 3 Opus by 5.4% on average and matches or exceeds GPT-4V on 11 benchmarks. For video tasks, Hyperion shows a 12.3% improvement over the previous best model.
Method: We collected a diverse training dataset with 2 million high-quality videos, 1.5 billion multimodal examples, and used reinforcement learning from human feedback to align the model with human preferences. Hyperion uses a proprietary architecture with 850 billion parameters and implements a novel attention mechanism we call "temporal cross-frame reasoning."
Limitations: While Hyperion excels at most tasks, it still struggles with complex mathematical reasoning beyond high school level mathematics and occasionally hallucinates details in long videos (>10 minutes). • Self-reported
OpenAI-MRCR: 2 needle 128k
Internal benchmark AI: • Self-reported
OpenAI-MRCR: 2 needle 1M
Internal benchmark
AI:
Internal benchmark • Self-reported
TAU-bench Airline
Average from 5 without tools/prompts ([4]) • Self-reported
TAU-bench Retail
Average value by 5 without use tools/prompts ([4], model GPT-4o) • Self-reported
License & Metadata
License
proprietary
Announcement Date
April 14, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsGPT-5.1 Codex Mini
OpenAI
MM
Released:Nov 2025
Price:$0.25/1M tokens
GPT-5.1 Medium
OpenAI
MM
Released:Nov 2025
Price:$1.00/1M tokens
GPT-5.1 Codex High
OpenAI
MM
Released:Nov 2025
Price:$1.25/1M tokens
GPT-5.3 Codex
OpenAI
MM
Released:Feb 2026
Price:$1.75/1M tokens
GPT-5.1 Codex
OpenAI
MM
Released:Nov 2025
Price:$1.25/1M tokens
GPT-5.2 Codex
OpenAI
MM
Released:Jan 2026
Price:$1.75/1M tokens
GPT-4.1 mini
OpenAI
MM
Best score:0.9 (MMLU)
Released:Apr 2025
Price:$0.40/1M tokens
GPT-5.4 Pro
OpenAI
MM
Released:Mar 2026
Price:$15.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.