OpenAI logo

GPT-4.5

Multimodal
OpenAI

GPT-4.5 is OpenAI's most advanced model, offering improved reasoning, coding, and creative capabilities with faster performance and extended context processing compared to GPT-4. The model features enhanced instruction following, reduced hallucinations, and improved factual accuracy.

Key Specifications

Parameters
-
Context
128.0K
Release Date
February 27, 2025
Average Score
63.1%

Timeline

Key dates in the model's history
Announcement
February 27, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$75.00
Output (per 1M tokens)
$150.00
Max Input Tokens
128.0K
Max Output Tokens
4.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Accuracy in tests with multiple choice AI When testing models with tasks, choice answers, can simply which option model correct, and compare with correct answer. evaluation usually is used on benchmarks MMLU. Especially useful: When you several models between itself and fields knowledge. Limitations: representation about understanding model. Model can choose correct answer on basis even not taskSelf-reported
90.8%

Programming

Programming skills tests
HumanEval
Pass@1 Metric evaluation, which measures proportion tasks, model artificial intelligence with first attempts without errors. More high value Pass@1 indicates on performance model. metric important for understanding abilities model generate correct answers without necessity do several attempts. She/It often is used at evaluation models on tasks programming, mathematics and other fields, where is required accuracy. In context tasks programming Pass@1 shows percentage tasks, for which code passes all tests with first attempts. This direct score efficiency and accuracy modelSelf-reported
88.0%
SWE-Bench Verified
success AI: successSelf-reported
38.0%

Mathematics

Mathematical problems and computations
GSM8k
Accuracy answer AI: LMSelf-reported
97.0%

Reasoning

Logical reasoning and analysis
GPQA
Accuracy (Diamond) AI: Anthropic Let's analyze the model's responses using a diamond mode, focusing on depth, precision, and accuracy. Initial Assessment: - The model correctly identifies its identity as Claude from Anthropic. - The model demonstrates strong comprehension of the task, showing flexibility in addressing the question about algebraic expressions. Mathematical Reasoning: - The model demonstrates exceptional mathematical precision in explaining the algebraic expression problem. - Step-by-step reasoning is clear, logical, and methodical. - Explanations are technically accurate, showing deep understanding of mathematical concepts. - The model correctly identifies that x = -3 and y = 2 satisfy the system of equations. Response Quality: - Excellent clarity in explanations with appropriate level of detail. - The model maintains consistency throughout its reasoning process. - The approach is systematic and thorough, showcasing advanced problem-solving capabilities. - No mathematical errors or conceptual misunderstandings are present. Limitations: - None apparent in this response; the mathematical reasoning is sound and complete. Conclusion: The model demonstrates diamond-level accuracy in its mathematical reasoning. It correctly solves the system of equations, explains the solution process clearly, and verifies the results. The response shows excellent precision and depth of understanding in algebraic manipulation.Self-reported
69.5%

Multimodal

Working with images and visual data
MathVista
Accuracy We we compare model by their accuracy answers on questions. For Llama 3, we we evaluate accuracy on sets data: - MMLU: contains questions with several options answers, 57 various that, including science, and GSM8k: includes tasks by mathematics level initial school, which require multi-step solutions. - HumanEval: evaluates ability programming in Python. - GPQA: contains questions from and TruthfulQA: evaluates ability model answer on questions, which can HELM: this for evaluation performance language models in various tasks and scenariosSelf-reported
72.3%
MMMU
Accuracy AI: ChatGPTSelf-reported
75.2%

Other Tests

Specialized benchmarks
Aider-Polyglot Edit
Accuracy AI: AccuracySelf-reported
44.9%
AIME 2024
Accuracy AI: ChatGPT (GPT-4o)Self-reported
36.7%
CharXiv-D
Accuracy AI: ChatGPT-4o Performance Analysis Description: ChatGPT-4o's performance in this evaluation showed strong mathematical reasoning capabilities across diverse problems. The model demonstrated proficiency in structured problem-solving, consistently applying mathematical principles and showing work step-by-step. It effectively tackled both elementary problems and more complex scenarios requiring calculus, linear algebra, and abstract reasoning. Strengths observed include: - Methodical approach to problem decomposition - Clear articulation of solution pathways - Ability to recognize and apply appropriate theorems and formulas - Effective use of algebraic manipulation - Strong pattern recognition in number sequences and geometric problems Areas for improvement include occasional computational errors in multi-step problems and some challenges with highly abstract proofs. While the model generally provided correct answers, a few instances of mathematical errors occurred in complex calculations. The evaluation indicates ChatGPT-4o can serve as a capable assistant for mathematical problem-solving across academic levels, though human verification remains important for critical applications requiring guaranteed accuracy.Self-reported
90.0%
CharXiv-R
Accuracy AI21 StudioSelf-reported
55.4%
COLLIE
Accuracy AISelf-reported
72.3%
ComplexFuncBench
Accuracy AISelf-reported
63.0%
Graphwalks BFS <128k
Accuracy AI: ChatGPT 4o's performance was measured using overall score accuracy across various benchmarks. I investigated its ability to answer complex, multistep tasks in different domains: 1. Core Reasoning: GPQA benchmark, focusing on college-level science/engineering questions requiring deep reasoning 2. Mathematical Problem Solving: FrontierMath benchmark and AIME competitions 3. Coding and Function Creation: Evaluation of the model's ability to write complex, functioning code 4. Advanced Analysis: Tasks requiring multi-step reasoning over complex information For each benchmark, I analyzed: - Raw accuracy scores - Performance on different difficulty tiers - Comparison with previous models (GPT-4, Claude 3 Opus) - Error patterns and consistency I paid special attention to ChatGPT 4o's ability to maintain accuracy across long, multi-step reasoning chains—a key indicator of advanced reasoning capabilities.Self-reported
72.3%
Graphwalks parents <128k
Accuracy AI Translate: AccuracySelf-reported
72.6%
IFEval
Accuracy AI: Evaluation accuracy on correct answers to number questions. Model should provide answer for each question. In case if task requires answer, accuracy is determined between answer model and reference answer (with in or ). For questions with choice answer (for example, "/no" or choice) answer model should clearly on specific option. Partially correct or answersSelf-reported
88.2%
Internal API instruction following (hard)
Accuracy AI models should produce correct and reliable outputs. To evaluate a model's accuracy, assess its ability to solve problems correctly, especially in domains requiring precision and factual correctness. Evaluation methods include testing the model on benchmarks with established ground truth, comparing outputs against verified sources, measuring error rates, and analyzing the consistency of responses across similar queries. Models may demonstrate different accuracy levels across domains. For instance, a model might excel at mathematics problems while struggling with historical facts or may provide accurate information on common topics but falter on specialized knowledge. Observing where and why accuracy failures occur helps identify knowledge gaps or reasoning limitations. Some models may express high confidence even when incorrect, while others might appropriately express uncertainty when approaching the limits of their knowledge. AI: Accuracy Models artificial intelligence should and results. For evaluation accuracy model necessary evaluate her/its ability correctly solve tasks, especially in fields, requiring accuracy and actual evaluation include testing model on tasks with reference answers, comparison results with measurement errors and analysis answers on queries. Models can different levels accuracy in different fields. For example, model can with tasks, but with or provide information by but in specialized knowledge. for that, where and why errors accuracy, helps identify in knowledge or limitations in reasoning. Some model can high confidence even when they in then time how other can at to its knowledgeSelf-reported
54.0%
MMMLU
Accuracy AI: Not itself in mode time. not about that, that such accuracy (I that you also very ). that if you exact answer on incorrect question, answer will incorrect. no meaning evaluate, whether answers on intermediate — you result in end. with she/it can help solve tasks. about on more level, than simply perform step for step. whether task more effectively?Self-reported
85.1%
MultiChallenge
Accuracy AI: 94Self-reported
43.8%
MultiChallenge (o3-mini grader)
Accuracy AISelf-reported
50.1%
Multi-IF
Accuracy AISelf-reported
70.8%
OpenAI-MRCR: 2 needle 128k
Accuracy AI: The accuracy of an AI model's responses to complex queries, measured by comparing answers against a known ground truth. High accuracy indicates reliable information, while low accuracy may signal knowledge gaps or reasoning flaws. Accuracy evaluation is crucial for critical applications where factual correctness is essential.Self-reported
38.5%
SimpleQA
accuracySelf-reported
62.5%
SWE-Lancer
($186K )Self-reported
37.3%
SWE-Lancer (IC-Diamond subset)
success ($41K)Self-reported
17.4%
TAU-bench Airline
Accuracy AI: ChatGPT assignment correctly, "Accuracy" how "Accuracy". This correct translation in context artificial intelligence. Translation and matchesSelf-reported
50.0%
TAU-bench Retail
Accuracy accuracy evaluates, how well well answer model matches and with answer should only statements and corresponding in given field. Evaluation 5: Answer and contains all necessary and reasoning for answer on question. Evaluation 4: Answer in mainly with minor or which not on general understanding or final result. Evaluation 3: Answer contains several actual errors or but main and final result in whole Evaluation 2: Answer contains significant errors, incorrect or reasoning, but also includes some correct Evaluation 1: Answer contains errors and incorrect reasoning, which fully correctness answerSelf-reported
68.4%

License & Metadata

License
proprietary
Announcement Date
February 27, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.