GPT-4.5
MultimodalGPT-4.5 is OpenAI's most advanced model, offering improved reasoning, coding, and creative capabilities with faster performance and extended context processing compared to GPT-4. The model features enhanced instruction following, reduced hallucinations, and improved factual accuracy.
Key Specifications
Parameters
-
Context
128.0K
Release Date
February 27, 2025
Average Score
63.1%
Timeline
Key dates in the model's history
Announcement
February 27, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$75.00
Output (per 1M tokens)
$150.00
Max Input Tokens
128.0K
Max Output Tokens
4.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Accuracy in tests with multiple choice AI When testing models with tasks, choice answers, can simply which option model correct, and compare with correct answer. evaluation usually is used on benchmarks MMLU. Especially useful: When you several models between itself and fields knowledge. Limitations: representation about understanding model. Model can choose correct answer on basis even not task • Self-reported
Programming
Programming skills tests
HumanEval
Pass@1 Metric evaluation, which measures proportion tasks, model artificial intelligence with first attempts without errors. More high value Pass@1 indicates on performance model. metric important for understanding abilities model generate correct answers without necessity do several attempts. She/It often is used at evaluation models on tasks programming, mathematics and other fields, where is required accuracy. In context tasks programming Pass@1 shows percentage tasks, for which code passes all tests with first attempts. This direct score efficiency and accuracy model • Self-reported
SWE-Bench Verified
success AI: success • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
Accuracy answer
AI: LM • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Accuracy (Diamond)
AI: Anthropic
Let's analyze the model's responses using a diamond mode, focusing on depth, precision, and accuracy.
Initial Assessment:
- The model correctly identifies its identity as Claude from Anthropic.
- The model demonstrates strong comprehension of the task, showing flexibility in addressing the question about algebraic expressions.
Mathematical Reasoning:
- The model demonstrates exceptional mathematical precision in explaining the algebraic expression problem.
- Step-by-step reasoning is clear, logical, and methodical.
- Explanations are technically accurate, showing deep understanding of mathematical concepts.
- The model correctly identifies that x = -3 and y = 2 satisfy the system of equations.
Response Quality:
- Excellent clarity in explanations with appropriate level of detail.
- The model maintains consistency throughout its reasoning process.
- The approach is systematic and thorough, showcasing advanced problem-solving capabilities.
- No mathematical errors or conceptual misunderstandings are present.
Limitations:
- None apparent in this response; the mathematical reasoning is sound and complete.
Conclusion:
The model demonstrates diamond-level accuracy in its mathematical reasoning. It correctly solves the system of equations, explains the solution process clearly, and verifies the results. The response shows excellent precision and depth of understanding in algebraic manipulation. • Self-reported
Multimodal
Working with images and visual data
MathVista
Accuracy We we compare model by their accuracy answers on questions. For Llama 3, we we evaluate accuracy on sets data: - MMLU: contains questions with several options answers, 57 various that, including science, and GSM8k: includes tasks by mathematics level initial school, which require multi-step solutions. - HumanEval: evaluates ability programming in Python. - GPQA: contains questions from and TruthfulQA: evaluates ability model answer on questions, which can HELM: this for evaluation performance language models in various tasks and scenarios • Self-reported
MMMU
Accuracy
AI: ChatGPT • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot Edit
Accuracy
AI:
Accuracy • Self-reported
AIME 2024
Accuracy
AI: ChatGPT (GPT-4o) • Self-reported
CharXiv-D
Accuracy
AI: ChatGPT-4o
Performance Analysis Description:
ChatGPT-4o's performance in this evaluation showed strong mathematical reasoning capabilities across diverse problems. The model demonstrated proficiency in structured problem-solving, consistently applying mathematical principles and showing work step-by-step. It effectively tackled both elementary problems and more complex scenarios requiring calculus, linear algebra, and abstract reasoning.
Strengths observed include:
- Methodical approach to problem decomposition
- Clear articulation of solution pathways
- Ability to recognize and apply appropriate theorems and formulas
- Effective use of algebraic manipulation
- Strong pattern recognition in number sequences and geometric problems
Areas for improvement include occasional computational errors in multi-step problems and some challenges with highly abstract proofs. While the model generally provided correct answers, a few instances of mathematical errors occurred in complex calculations.
The evaluation indicates ChatGPT-4o can serve as a capable assistant for mathematical problem-solving across academic levels, though human verification remains important for critical applications requiring guaranteed accuracy. • Self-reported
CharXiv-R
Accuracy
AI21 Studio • Self-reported
COLLIE
Accuracy
AI • Self-reported
ComplexFuncBench
Accuracy
AI • Self-reported
Graphwalks BFS <128k
Accuracy
AI: ChatGPT 4o's performance was measured using overall score accuracy across various benchmarks.
I investigated its ability to answer complex, multistep tasks in different domains:
1. Core Reasoning: GPQA benchmark, focusing on college-level science/engineering questions requiring deep reasoning
2. Mathematical Problem Solving: FrontierMath benchmark and AIME competitions
3. Coding and Function Creation: Evaluation of the model's ability to write complex, functioning code
4. Advanced Analysis: Tasks requiring multi-step reasoning over complex information
For each benchmark, I analyzed:
- Raw accuracy scores
- Performance on different difficulty tiers
- Comparison with previous models (GPT-4, Claude 3 Opus)
- Error patterns and consistency
I paid special attention to ChatGPT 4o's ability to maintain accuracy across long, multi-step reasoning chains—a key indicator of advanced reasoning capabilities. • Self-reported
Graphwalks parents <128k
Accuracy
AI Translate: Accuracy • Self-reported
IFEval
Accuracy AI: Evaluation accuracy on correct answers to number questions. Model should provide answer for each question. In case if task requires answer, accuracy is determined between answer model and reference answer (with in or ). For questions with choice answer (for example, "/no" or choice) answer model should clearly on specific option. Partially correct or answers • Self-reported
Internal API instruction following (hard)
Accuracy AI models should produce correct and reliable outputs. To evaluate a model's accuracy, assess its ability to solve problems correctly, especially in domains requiring precision and factual correctness. Evaluation methods include testing the model on benchmarks with established ground truth, comparing outputs against verified sources, measuring error rates, and analyzing the consistency of responses across similar queries. Models may demonstrate different accuracy levels across domains. For instance, a model might excel at mathematics problems while struggling with historical facts or may provide accurate information on common topics but falter on specialized knowledge. Observing where and why accuracy failures occur helps identify knowledge gaps or reasoning limitations. Some models may express high confidence even when incorrect, while others might appropriately express uncertainty when approaching the limits of their knowledge. AI: Accuracy Models artificial intelligence should and results. For evaluation accuracy model necessary evaluate her/its ability correctly solve tasks, especially in fields, requiring accuracy and actual evaluation include testing model on tasks with reference answers, comparison results with measurement errors and analysis answers on queries. Models can different levels accuracy in different fields. For example, model can with tasks, but with or provide information by but in specialized knowledge. for that, where and why errors accuracy, helps identify in knowledge or limitations in reasoning. Some model can high confidence even when they in then time how other can at to its knowledge • Self-reported
MMMLU
Accuracy AI: Not itself in mode time. not about that, that such accuracy (I that you also very ). that if you exact answer on incorrect question, answer will incorrect. no meaning evaluate, whether answers on intermediate — you result in end. with she/it can help solve tasks. about on more level, than simply perform step for step. whether task more effectively? • Self-reported
MultiChallenge
Accuracy
AI: 94 • Self-reported
MultiChallenge (o3-mini grader)
Accuracy
AI • Self-reported
Multi-IF
Accuracy
AI • Self-reported
OpenAI-MRCR: 2 needle 128k
Accuracy
AI: The accuracy of an AI model's responses to complex queries, measured by comparing answers against a known ground truth. High accuracy indicates reliable information, while low accuracy may signal knowledge gaps or reasoning flaws. Accuracy evaluation is crucial for critical applications where factual correctness is essential. • Self-reported
SimpleQA
accuracy • Self-reported
SWE-Lancer
($186K ) • Self-reported
SWE-Lancer (IC-Diamond subset)
success ($41K) • Self-reported
TAU-bench Airline
Accuracy AI: ChatGPT assignment correctly, "Accuracy" how "Accuracy". This correct translation in context artificial intelligence. Translation and matches • Self-reported
TAU-bench Retail
Accuracy accuracy evaluates, how well well answer model matches and with answer should only statements and corresponding in given field. Evaluation 5: Answer and contains all necessary and reasoning for answer on question. Evaluation 4: Answer in mainly with minor or which not on general understanding or final result. Evaluation 3: Answer contains several actual errors or but main and final result in whole Evaluation 2: Answer contains significant errors, incorrect or reasoning, but also includes some correct Evaluation 1: Answer contains errors and incorrect reasoning, which fully correctness answer • Self-reported
License & Metadata
License
proprietary
Announcement Date
February 27, 2025
Last Updated
July 19, 2025
Similar Models
All Modelso4-mini
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$1.10/1M tokens
GPT-4o
OpenAI
MM
Best score:0.9 (MMLU)
Released:Aug 2024
Price:$2.50/1M tokens
GPT-4o mini
OpenAI
MM
Best score:0.9 (HumanEval)
Released:Jul 2024
Price:$0.15/1M tokens
o3
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$2.00/1M tokens
GPT-4.1
OpenAI
MM
Best score:0.9 (MMLU)
Released:Apr 2025
Price:$2.00/1M tokens
GPT-5 nano
OpenAI
MM
Best score:0.7 (GPQA)
Released:Aug 2025
Price:$0.05/1M tokens
o1-pro
OpenAI
MM
Best score:0.8 (GPQA)
Released:Dec 2024
GPT-4
OpenAI
MM
Best score:1.0 (ARC)
Released:Jun 2023
Price:$30.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.