Key Specifications
Parameters
-
Context
1.0M
Release Date
April 14, 2025
Average Score
56.8%
Timeline
Key dates in the model's history
Announcement
April 14, 2025
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
June 1, 2024
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$2.00
Output (per 1M tokens)
$8.00
Max Input Tokens
1.0M
Max Output Tokens
32.8K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Standard benchmark
AI: Translate following text:
To demonstrate that LLMs can actually learn concepts with just a few examples, I asked a modern LLM to solve a simple problem: determining whether a word is ambiguous or not. • Self-reported
Programming
Programming skills tests
SWE-Bench Verified
methodology, [2] • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Diamond
AI: Diamond • Self-reported
Multimodal
Working with images and visual data
MathVista
## Standard benchmark Standard benchmark — this common approach to comparison language models, at which model receive and those indeed questions and tasks, and their performance is evaluated with using standard metrics. For example, benchmark MMLU offers tasks with multiple choice from different fields knowledge, including and Benchmark HumanEval evaluates ability language model generate correct code on basis Benchmark TruthfulQA evaluates, answers whether model on questions, on which people often incorrectly. advantages: - : tasks and metrics comparison. - testing: benchmarks can quickly for evaluation new systems. - comparison: standard metrics allow directly compare model. Limitations: - : model can under specific tests, and not general abilities. - : tests often not real tasks. - metrics: evaluation on basis in advance specific answers can not or important nuances • Self-reported
MMMU
Standard benchmark
AI: GPT-4o
AI Generated content: Benchmark the model on a set of standard benchmarks selected for the task domain. For example, for mathematical reasoning benchmark the model on MMLU or benchmarks specifically targeting mathematical reasoning such as GSM-8K. For programming, benchmark the model on HumanEval, MBPP, or other coding benchmarks. Compare the model's performance against the claimed or expected performance of the model. Any differences should be noted. • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
Standard benchmark AI: Good, translation text: Standard benchmark AI Assistant: Standard benchmark • Self-reported
Aider-Polyglot Edit
Standard benchmark AI: Translate this about that, how LLM solves tasks by mathematics with reasoning aloud: ## Chain-of-Thought prompting Chain-of-Thought (CoT) prompting is a technique that encourages the model to break down complex problems into step-by-step solutions. By prompting the model to "think aloud" through intermediate reasoning steps, CoT has been shown to significantly improve performance on tasks requiring multi-step reasoning, like mathematical problem solving. When we implement CoT, we typically add phrases like "Let's think through this step by step" to the prompt, which encourages the model to work through the problem methodically rather than jumping straight to an answer. For mathematical problems specifically, CoT helps the model organize calculations, track variables, and maintain logical coherence throughout the solution process. Recent research has shown that CoT is particularly effective for more capable models, suggesting that this technique leverages the inherent reasoning capabilities that exist within larger language models but need to be properly elicited • Self-reported
AIME 2024
## Standard benchmark Standard benchmark — this process evaluation performance or efficiency model AI on basis in advance specific set tasks or assignments. This method, for measurement performance systems and comparison her/its with other or with In case models AI benchmarks can include diverse tasks, such how answers on questions, solution mathematical tasks, tasks on reasoning, understanding language and etc.etc. Results these benchmarks often are used for determination, how well well model performs various tasks, and also for comparison different models between itself. benchmarks are important tool in research AI, since they provide way measurement and comparison various approaches. They also can help identify strong and weak side model, that can research and • Self-reported
CharXiv-D
Standard benchmark
Standard benchmark
AI: 1 Human: 0 • Self-reported
CharXiv-R
Standard benchmark
Standard benchmark
AI: HuggingGPT • Self-reported
COLLIE
Standard benchmark
AI: I begin with a standard set of test questions.
I'll analyze the results across metrics like accuracy,
reasoning ability, and common error patterns. This gives
me a baseline understanding of the model's capabilities
and limitations on established problem sets. • Self-reported
ComplexFuncBench
Standard benchmark
AI: LLama-7B fine-tuned on math problems (open-source).
I've created a benchmark approach that uses the standard community benchmarks to evaluate and analyze a model's capabilities:
1. I systematically work through major benchmark datasets like MMLU, GSM8k, MATH, HumanEval, and others, applying consistent evaluation criteria.
2. I don't just look at overall accuracy, but analyze subcategories to identify specific strengths and weaknesses.
3. For math problems, I trace through the model's chain-of-thought to identify where reasoning breaks down.
4. I compare performance against other models in similar size classes to establish relative capabilities.
5. I test sensitivity to prompt engineering by evaluating performance across different instruction formats.
This approach provides an objective baseline that reveals a model's fundamental capabilities rather than just optimizing for specific test cases. It allows me to understand where a model excels and where it falls short compared to others in its class. • Self-reported
Graphwalks BFS <128k
Standard benchmark In this we on our new model Claude 3.5 Sonnet on standard benchmarks. We we measure her/its capabilities on benchmarks for evaluation model, including GPQA, MMLU, GSM8K, MATH and HumanEval. These tests skills, from knowledge general and abilities follow instructions to solutions mathematical tasks and programming. We we present comparison with results for models Claude 3 Opus, Claude 3 Sonnet, GPT-4 Turbo, GPT-4o, and also GPT-4. For new model Sonnet we results for model base, without which-or additional settings for specific assignments. On all five benchmarks Claude 3.5 Sonnet outperforms Claude 3 Opus. Especially improvements in fields, requiring reasoning and abilities solve complex tasks: +13% on GPQA, +5% on MATH and +4% on GSM8K. This indicates on improvement basic abilities to reasoning • Self-reported
Graphwalks BFS >128k
Internal benchmark
AI: AIME, Math Competition, Thinking Mode
This LLM is using an "internal benchmark" approach, where it explicitly compares itself to other AI models. When faced with the AIME problem, it references comparative model performance, mentioning "models like Claude" failing at such problems while positioning itself as more capable.
The model specifically references mathematical competitions like AIME, showing familiarity with the domain. It approaches the problem using a defined "thinking mode" methodology, carefully working through the problem step by step rather than attempting to produce an immediate answer.
This behavior suggests the model has been explicitly trained or fine-tuned on mathematical reasoning tasks and has been given information about its own capabilities relative to other models. The structured approach with explicit problem decomposition indicates specialized training in mathematical problem-solving techniques. • Self-reported
Graphwalks parents <128k
Internal benchmark
AI: Yikes! The AI was indeed supposed to be more comprehensive in translating this text. Let me apologize and correct it: • Self-reported
Graphwalks parents >128k
Internal benchmark
AI: I'm only going to review the few sections in this benchmark, where I believe I can have the most value. • Self-reported
IFEval
Standard benchmark Standard benchmark AI: obtaining evaluations in benchmarks. For subject fields, already we we can simply compare evaluation various systems. These comparison when model: - Outperforms all model - outperforms other model with to test/on tasks, which were for AI - performance, than other model (for example, significantly in tasks, but in other) We we can use benchmarks three types: - benchmarks, in (for example, MMLU, GPQA, GSM8K) - specifically for measurement capabilities strong models (for example, MATH, FrontierMath) - benchmarks, for testing specific model (for example, GPT-4 Eval) • Self-reported
Internal API instruction following (hard)
Internal benchmark AI: Translate on Russian following text method analysis. ONLY translation, without quotes, without without explanations. ``` We evaluated Llama 2 on a variety of benchmarks to measure its performance on standard metrics and tasks. In this section, we present results on a subset of these benchmarks. Our model evaluations focus on helpfulness and safety. For helpfulness, we evaluate on several multiple-choice question answering datasets. For safety, we evaluate a fine-tuned model on a suite of benchmarks including ToxiGen, measuring toxic content generation, and Civil Comments, measuring toxic content detection. ``` • Self-reported
MMMLU
Standard benchmark
AI: I will first solve a problem from scratch to identify the correct approach and solution, then convert the solution to the desired format. • Self-reported
MultiChallenge
Standard benchmark (GPT-4o in capacity ) AI: *provides solutions for tasks benchmark* GPT-4o: *evaluates each task how or on basis solutions* Advantages: • Fully evaluation • use benchmarks • Good methodology Disadvantages: • from GPT-4o can lead to to in evaluation • GPT-4o can be on test sets, problem data • evaluate nuances in or partially correct answers • Usually requires, in order to model full solution, and not only answer • Self-reported
MultiChallenge (o3-mini grader)
Standard benchmark (o3-mini grader, [3]) • Self-reported
Multi-IF
Standard benchmark AI: This then, that usually represent all model - how their scores in This includes standard tests, such how MMLU, MATH, GSM8K and etc.etc. also most new benchmarks: 1. GPQA: new benchmark for evaluation deep knowledge 2. FrontierMath: with tasks by mathematics level 3. AIME: by mathematics for 4. Harvard-MIT Mathematics Tournament • Self-reported
OpenAI-MRCR: 2 needle 128k
Internal benchmark
AI: Internal benchmark • Self-reported
OpenAI-MRCR: 2 needle 1M
Internal benchmark
AI: Internal benchmark • Self-reported
TAU-bench Airline
Average from 5 without tools/prompts ([4]) • Self-reported
TAU-bench Retail
Average by 5 without special tools/prompts ([4], model GPT-4o) • Self-reported
Video-MME (long, no subtitles)
Standard benchmark
AI: RoboVQA (neelayjunnarkar/robovqa), Claude 3.5 Sonnet, OCRA
Benchmark methodology
I evaluated several models on their ability to answer robot visual question answering questions from the RoboVQA dataset. I evaluated each model on a test set of 10 randomly selected examples, feeding models with the image (where available) and accompanying question. I evaluated each model in a zero-shot setting, without any specific prompting other than the question itself. • Self-reported
AIME 2025
GPT-4.1 without tools - mathematics (AIME 2025) • Self-reported
Humanity's Last Exam
GPT-4.1 without tools - Questions expert level by various subjects. • Self-reported
HMMT 2025
GPT-4.1 without tools - Harvard-MIT Mathematics Tournament. • Self-reported
License & Metadata
License
proprietary
Announcement Date
April 14, 2025
Last Updated
July 19, 2025
Similar Models
All Modelso4-mini
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$1.10/1M tokens
GPT-4o
OpenAI
MM
Best score:0.9 (MMLU)
Released:Aug 2024
Price:$2.50/1M tokens
GPT-4o mini
OpenAI
MM
Best score:0.9 (HumanEval)
Released:Jul 2024
Price:$0.15/1M tokens
o3
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$2.00/1M tokens
GPT-4.5
OpenAI
MM
Best score:0.9 (MMLU)
Released:Feb 2025
Price:$75.00/1M tokens
GPT-5 nano
OpenAI
MM
Best score:0.7 (GPQA)
Released:Aug 2025
Price:$0.05/1M tokens
GPT-4
OpenAI
MM
Best score:1.0 (ARC)
Released:Jun 2023
Price:$30.00/1M tokens
GPT-5.1 Codex
OpenAI
MM
Released:Nov 2025
Price:$1.25/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.