GPT-4.5

Name: GPT-4.5
Author: OpenAI

Multimodal

OpenAI

GPT-4.5 is OpenAI's most advanced model, offering improved reasoning, coding, and creative capabilities with faster performance and extended context processing compared to GPT-4. The model features enhanced instruction following, reduced hallucinations, and improved factual accuracy.

Key Specifications

Parameters

Context

128.0K

Release Date

February 27, 2025

Average Score

63.1%

Repository Results Blog

Timeline

Key dates in the model's history

Announcement

February 27, 2025

Last Update

July 19, 2025

Today

March 25, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$75.00

Output (per 1M tokens)

$150.00

Max Input Tokens

128.0K

Max Output Tokens

4.1K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Accuracy in tests with multiple choice AI When testing models with tasks, choice answers, can simply which option model correct, and compare with correct answer. evaluation usually is used on benchmarks MMLU. Especially useful: When you several models between itself and fields knowledge. Limitations: representation about understanding model. Model can choose correct answer on basis even not task • Self-reported

90.8%

Programming

Programming skills tests

HumanEval

Pass@1 Metric evaluation, which measures proportion tasks, model artificial intelligence with first attempts without errors. More high value Pass@1 indicates on performance model. metric important for understanding abilities model generate correct answers without necessity do several attempts. She/It often is used at evaluation models on tasks programming, mathematics and other fields, where is required accuracy. In context tasks programming Pass@1 shows percentage tasks, for which code passes all tests with first attempts. This direct score efficiency and accuracy model • Self-reported

88.0%

SWE-Bench Verified

success AI: success • Self-reported

38.0%

Mathematics

Mathematical problems and computations

GSM8k

Accuracy answer AI: LM • Self-reported

97.0%

Reasoning

Logical reasoning and analysis

GPQA

Accuracy (Diamond) AI: Anthropic Let's analyze the model's responses using a diamond mode, focusing on depth, precision, and accuracy. Initial Assessment: - The model correctly identifies its identity as Claude from Anthropic. - The model demonstrates strong comprehension of the task, showing flexibility in addressing the question about algebraic expressions. Mathematical Reasoning: - The model demonstrates exceptional mathematical precision in explaining the algebraic expression problem. - Step-by-step reasoning is clear, logical, and methodical. - Explanations are technically accurate, showing deep understanding of mathematical concepts. - The model correctly identifies that x = -3 and y = 2 satisfy the system of equations. Response Quality: - Excellent clarity in explanations with appropriate level of detail. - The model maintains consistency throughout its reasoning process. - The approach is systematic and thorough, showcasing advanced problem-solving capabilities. - No mathematical errors or conceptual misunderstandings are present. Limitations: - None apparent in this response; the mathematical reasoning is sound and complete. Conclusion: The model demonstrates diamond-level accuracy in its mathematical reasoning. It correctly solves the system of equations, explains the solution process clearly, and verifies the results. The response shows excellent precision and depth of understanding in algebraic manipulation. • Self-reported

69.5%

Multimodal

Working with images and visual data

MathVista

Accuracy We we compare model by their accuracy answers on questions. For Llama 3, we we evaluate accuracy on sets data: - MMLU: contains questions with several options answers, 57 various that, including science, and GSM8k: includes tasks by mathematics level initial school, which require multi-step solutions. - HumanEval: evaluates ability programming in Python. - GPQA: contains questions from and TruthfulQA: evaluates ability model answer on questions, which can HELM: this for evaluation performance language models in various tasks and scenarios • Self-reported

72.3%

MMMU

Accuracy AI: ChatGPT • Self-reported

75.2%

Other Tests

Specialized benchmarks

Aider-Polyglot Edit

Accuracy AI: Accuracy • Self-reported

44.9%

AIME 2024

Accuracy AI: ChatGPT (GPT-4o) • Self-reported

36.7%

CharXiv-D

Accuracy AI: ChatGPT-4o Performance Analysis Description: ChatGPT-4o's performance in this evaluation showed strong mathematical reasoning capabilities across diverse problems. The model demonstrated proficiency in structured problem-solving, consistently applying mathematical principles and showing work step-by-step. It effectively tackled both elementary problems and more complex scenarios requiring calculus, linear algebra, and abstract reasoning. Strengths observed include: - Methodical approach to problem decomposition - Clear articulation of solution pathways - Ability to recognize and apply appropriate theorems and formulas - Effective use of algebraic manipulation - Strong pattern recognition in number sequences and geometric problems Areas for improvement include occasional computational errors in multi-step problems and some challenges with highly abstract proofs. While the model generally provided correct answers, a few instances of mathematical errors occurred in complex calculations. The evaluation indicates ChatGPT-4o can serve as a capable assistant for mathematical problem-solving across academic levels, though human verification remains important for critical applications requiring guaranteed accuracy. • Self-reported

90.0%

CharXiv-R

Accuracy AI21 Studio • Self-reported

55.4%

COLLIE

Accuracy AI • Self-reported

72.3%

ComplexFuncBench

Accuracy AI • Self-reported

63.0%

Graphwalks BFS <128k

Accuracy AI: ChatGPT 4o's performance was measured using overall score accuracy across various benchmarks. I investigated its ability to answer complex, multistep tasks in different domains: 1. Core Reasoning: GPQA benchmark, focusing on college-level science/engineering questions requiring deep reasoning 2. Mathematical Problem Solving: FrontierMath benchmark and AIME competitions 3. Coding and Function Creation: Evaluation of the model's ability to write complex, functioning code 4. Advanced Analysis: Tasks requiring multi-step reasoning over complex information For each benchmark, I analyzed: - Raw accuracy scores - Performance on different difficulty tiers - Comparison with previous models (GPT-4, Claude 3 Opus) - Error patterns and consistency I paid special attention to ChatGPT 4o's ability to maintain accuracy across long, multi-step reasoning chains—a key indicator of advanced reasoning capabilities. • Self-reported

72.3%

Graphwalks parents <128k

Accuracy AI Translate: Accuracy • Self-reported

72.6%

IFEval

Accuracy AI: Evaluation accuracy on correct answers to number questions. Model should provide answer for each question. In case if task requires answer, accuracy is determined between answer model and reference answer (with in or ). For questions with choice answer (for example, "/no" or choice) answer model should clearly on specific option. Partially correct or answers • Self-reported

88.2%

Internal API instruction following (hard)

Accuracy AI models should produce correct and reliable outputs. To evaluate a model's accuracy, assess its ability to solve problems correctly, especially in domains requiring precision and factual correctness. Evaluation methods include testing the model on benchmarks with established ground truth, comparing outputs against verified sources, measuring error rates, and analyzing the consistency of responses across similar queries. Models may demonstrate different accuracy levels across domains. For instance, a model might excel at mathematics problems while struggling with historical facts or may provide accurate information on common topics but falter on specialized knowledge. Observing where and why accuracy failures occur helps identify knowledge gaps or reasoning limitations. Some models may express high confidence even when incorrect, while others might appropriately express uncertainty when approaching the limits of their knowledge. AI: Accuracy Models artificial intelligence should and results. For evaluation accuracy model necessary evaluate her/its ability correctly solve tasks, especially in fields, requiring accuracy and actual evaluation include testing model on tasks with reference answers, comparison results with measurement errors and analysis answers on queries. Models can different levels accuracy in different fields. For example, model can with tasks, but with or provide information by but in specialized knowledge. for that, where and why errors accuracy, helps identify in knowledge or limitations in reasoning. Some model can high confidence even when they in then time how other can at to its knowledge • Self-reported

54.0%

MMMLU

Accuracy AI: Not itself in mode time. not about that, that such accuracy (I that you also very ). that if you exact answer on incorrect question, answer will incorrect. no meaning evaluate, whether answers on intermediate — you result in end. with she/it can help solve tasks. about on more level, than simply perform step for step. whether task more effectively? • Self-reported

85.1%

MultiChallenge

Accuracy AI: 94 • Self-reported

43.8%

MultiChallenge (o3-mini grader)

Accuracy AI • Self-reported

50.1%

Multi-IF

Accuracy AI • Self-reported

70.8%

OpenAI-MRCR: 2 needle 128k

Accuracy AI: The accuracy of an AI model's responses to complex queries, measured by comparing answers against a known ground truth. High accuracy indicates reliable information, while low accuracy may signal knowledge gaps or reasoning flaws. Accuracy evaluation is crucial for critical applications where factual correctness is essential. • Self-reported

38.5%

SimpleQA

accuracy • Self-reported

62.5%

SWE-Lancer

($186K ) • Self-reported

37.3%

SWE-Lancer (IC-Diamond subset)

success ($41K) • Self-reported

17.4%

TAU-bench Airline

Accuracy AI: ChatGPT assignment correctly, "Accuracy" how "Accuracy". This correct translation in context artificial intelligence. Translation and matches • Self-reported

50.0%

TAU-bench Retail

Accuracy accuracy evaluates, how well well answer model matches and with answer should only statements and corresponding in given field. Evaluation 5: Answer and contains all necessary and reasoning for answer on question. Evaluation 4: Answer in mainly with minor or which not on general understanding or final result. Evaluation 3: Answer contains several actual errors or but main and final result in whole Evaluation 2: Answer contains significant errors, incorrect or reasoning, but also includes some correct Evaluation 1: Answer contains errors and incorrect reasoning, which fully correctness answer • Self-reported

68.4%

License & Metadata

License

proprietary

Announcement Date

February 27, 2025

Last Updated

July 19, 2025

Similar Models

All Models

o4-mini

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$1.10/1M tokens

GPT-4o

OpenAI

Best score:0.9 (MMLU)

Released:Aug 2024

Price:$2.50/1M tokens

GPT-4o mini

OpenAI

Best score:0.9 (HumanEval)

Released:Jul 2024

Price:$0.15/1M tokens

o3

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-4.1

OpenAI

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-5 nano

OpenAI

Best score:0.7 (GPQA)

Released:Aug 2025

Price:$0.05/1M tokens

o1-pro

OpenAI

Best score:0.8 (GPQA)

Released:Dec 2024

GPT-4

OpenAI

Best score:1.0 (ARC)

Released:Jun 2023

Price:$30.00/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.