GPT-4o

Name: GPT-4o
Author: OpenAI

Multimodal

OpenAI

GPT-4o ('o' stands for 'omni') is a multimodal AI model that accepts text, audio, image, and video inputs and generates text, audio, and image outputs. It matches GPT-4 Turbo performance on text and code, with improvements in understanding non-English languages, images, and audio.

Key Specifications

Parameters

Context

128.0K

Release Date

August 6, 2024

Average Score

52.8%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

August 6, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$2.50

Output (per 1M tokens)

$10.00

Max Input Tokens

128.0K

Max Output Tokens

16.4K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Accuracy Model high accuracy should statements by and tasks, even at work with or for accuracy model can on benchmarks facts and knowledge, such how TruthfulQA. Accuracy also can evaluate in specific fields, for example: execution mathematical tasks, reproduction or logical errors or algorithms. By capabilities also important directly evaluate accuracy on requiring knowledge or reasoning, and not only on questions about models: - accuracy: model has significant in knowledge or when not accuracy: model usually but can errors in complex cases or at accuracy: model exact answers even in complex or situations, about its and • Self-reported

85.7%

Programming

Programming skills tests

SWE-Bench Verified

Accuracy AI: ChatGPT — 8.92 Anthropic: Claude — 9.59 Others: Google Gemini — 9.28, LLaMA3-70B — 7.39 Accuracy is perhaps the most obvious and well-known metric to measure the performance of LLMs. When considering a question-answering task or a conversation, we might use metrics like what percentage of the responses are factually correct, or what percentage of factual claims in the response are true. Accuracy is the bedrock of trust in AI systems and the main metric used in most of the established LLM benchmarks like MMLU, GPQA, Big-Bench Hard, GSM8K, etc. When we talk about accuracy, it's worth examining both absolute accuracy (model correctness relative to the ground truth) and comparative accuracy (how the model's knowledge compares to other models). An advanced model doesn't need to be correct 100% of the time, but when it's wrong, we expect it to be wrong in the right ways - perhaps due to the inherent ambiguity of the question, the subjective nature of the domain, or a lack of perfect ground truth. We particularly want to avoid confident but incorrect responses (hallucinations). • Self-reported

33.2%

Reasoning

Logical reasoning and analysis

GPQA

GPT-4o - Diamond without mode thinking without tools AI: Quantum : Model GPT-4o works on level Diamond on GPQA. that such level performance without use mode thinking or tools. This that GPT-4o significantly capabilities by comparison with models. Ability GPT-4o work on level Diamond without additional methods about understanding and reasoning. Such level performance matches knowledge. Although we not we can do conclusions about level knowledge model, her/its ability solve complex tasks on GPQA without additional methods indicates on in capabilities by comparison with more models, how GPT-3.5, which tools or approaches for achievements similar results • Self-reported

70.1%

Multimodal

Working with images and visual data

AI2D

Evaluation on test set AI: Translate full text • Self-reported

94.2%

ChartQA

evaluation on test set • Self-reported

85.7%

DocVQA

evaluation on test set • Self-reported

92.8%

MathVista

Accuracy AI: ChatGPT-4 exhibits impressive accuracy across most tasks, with very low rates of factual error or made-up information (hallucinations). It shows particularly strong performance in coding, mathematics, and reasoning tasks compared to earlier models. The model can correctly solve complex multi-step problems in domains ranging from symbolic mathematics to physics, demonstrating solid conceptual understanding. When given well-formatted problems with clear parameters, GPT-4 achieves high accuracy rates. However, accuracy degrades when: • Problems require extremely specialized knowledge • Questions contain ambiguity or poorly specified parameters • Tasks involve very lengthy chains of reasoning where error can compound • The model must recall specific obscure facts or handle tasks requiring perfect precision GPT-4 sometimes demonstrates a tendency toward overconfidence in areas where its knowledge is limited. This can lead to plausible-sounding but incorrect answers. The model's ability to recognize the boundaries of its knowledge and express appropriate uncertainty remains inconsistent. • Self-reported

61.4%

MMMU

GPT-4o without mode thinking - Solution visual tasks level with • Self-reported

72.2%

Other Tests

Specialized benchmarks

ActivityNet

evaluation on test set • Self-reported

61.9%

Aider-Polyglot

Accuracy AI • Self-reported

30.7%

Aider-Polyglot Edit

Accuracy AI • Self-reported

18.2%

AIME 2024

Accuracy AI • Self-reported

13.1%

CharXiv-D

Accuracy AI: PageRank connection how therefore with number receive more rating When receives with she/it receives part "" this is "" to • Self-reported

85.3%

CharXiv-R

GPT-4o without mode thinking - justification and • Self-reported

58.8%

COLLIE

GPT-4o without mode thinking - instructions at text • Self-reported

61.0%

Tau2 airline

GPT-4o without mode thinking - Benchmark functions () • Self-reported

45.5%

Tau2 retail

GPT-4o without mode thinking - Benchmark functions () • Self-reported

63.4%

Tau2 telecom

GPT-4o without mode thinking - Benchmark functions (field) • Self-reported

23.5%

MMMU-Pro

GPT-4o without mode thinking - Solution visual tasks level with reasoning • Self-reported

59.9%

VideoMMMU

GPT-4o without mode thinking - multimodal reasoning (256 ) • Self-reported

61.2%

ERQA

GPT-4o without mode thinking - thinking • Self-reported

35.2%

ComplexFuncBench

Accuracy AI: evaluate by abilities give exact answers. In order to accuracy, I: 1. format question and level 2. information, in prompt 3. or answers 4. in answer justification 5. where this 6. its reasoning before provision final answer In I first about approach to solving, then sequentially task and its answer. In with actual I provide information and when at no knowledge • Self-reported

66.5%

EgoSchema

evaluation on test set • Self-reported

72.2%

Graphwalks BFS <128k

Accuracy AI: LLaMa-3.1-405B's accuracy is acceptable but significantly behind GPT-4o, Claude 3.5 Sonnet, and Claude 3 Opus. It scores above GPT-4 and other smaller models, but is far from state-of-the-art performance. Analysis: In most knowledge-intensive benchmarks, LLaMa-3.1-405B performs above GPT-4 but significantly below the newest models. On MMLU, it scores 86.8%, placing it above GPT-4 (83.0%) but below Claude 3 Opus (88.2%), Claude 3.5 Sonnet (90.4%), and GPT-4o (90.0%). On GPQA, it scores 56.2%, above GPT-4's 42.1% but far below Claude 3 Opus (61.1%) and GPT-4o (67.9%). The model struggles with challenging reasoning tasks like MATH, scoring only 53.7% compared to Claude 3 Opus's 68.2% and GPT-4o's 69.4%. On GSM8K, it achieves 93.0%, which is good but below Claude 3.5 Sonnet's 97.0% and GPT-4o's 96.1%. Its factual accuracy shows substantial improvement over previous LLaMa models but still lags behind the most accurate frontier models. While perfectly adequate for many general use cases, users requiring state-of-the-art accuracy should consider other options. • Self-reported

41.7%

Graphwalks parents <128k

Accuracy AI: ChatGPT performs well on a variety of tasks that require accuracy. It can solve mathematical problems with a high degree of precision, often providing step-by-step solutions that demonstrate a clear understanding of mathematical concepts. For example, when asked to calculate the derivative of a complex function, ChatGPT can typically provide the correct answer along with the intermediate steps. In tasks requiring factual knowledge, ChatGPT generally provides accurate information about well-established facts, historical events, scientific concepts, and cultural phenomena. For instance, it can accurately describe the causes of World War I, explain the process of photosynthesis, or outline the plot of a classic novel. However, ChatGPT's accuracy has limitations. It may occasionally provide incorrect information or "hallucinate" facts, especially when dealing with very specialized domains, recent events, or ambiguous queries. When asked about cutting-edge research or obscure topics, ChatGPT may fill in gaps in its knowledge with plausible-sounding but incorrect information. • Self-reported

35.4%

IFEval

Accuracy AI • Self-reported

81.0%

Internal API instruction following (hard)

Accuracy AI: let's think about this step for step. ask find solution this mathematical tasks. Human: solution for log_4(x) = 3. AI: I equation log_4(x) = 3. log_4(x) = 3 means: "4 in which degree x?" need to find such number x, that 4^3 = x. : 4^3 = 4 × 4 × 4 = 64. solution equations log_4(x) = 3 x = 64. : - Answer fully correct, accuracy 100% - Reasoning correct • Self-reported

29.2%

MMLU-Pro

0-shot CoT Zero-shot Chain-of-Thought (0-shot CoT) — this technique reasoning, at which LLM ask reason step by step for solutions complex tasks, not providing examples such reasoning. In order to use this method, need to simply instruction "step for step" (or ) in prompt. This allows model independently break down complex tasks on managed steps instead attempts immediately give final answer. How such approach leads to performance on tasks, requiring reasoning, especially on and logical tasks. In our research we used "Let's let's solve this step for step" before each in set data • Self-reported

74.7%

MMMLU

Accuracy AI: translation: For each math reasoning ability, we define metrics that aim to measure how accurately the model is able to solve problems in those categories. In Appendix A, we provide details on the creation of all the datasets we use in evaluation. For Arithmetic, we use a dataset of 240 problems involving basic calculations using operations like addition, subtraction, multiplication, division, and exponentiation. We report the percentage of model responses that match the exact expected answer. For Symbolic Manipulation, we use a dataset of 100 problems that involve algebraic manipulation including simplification of expressions, factoring polynomials, and expanding expressions. We report the percentage of model responses that match the exact expected answer. For Analytic Solutions, we use a dataset of 100 problems that involve finding closed-form solutions to equations, integrals, and derivatives. We report the percentage of model responses that match the exact expected answer. For Step-by-Step Solutions, we use a dataset of 100 problems that involve step-by-step solutions to equations, integrals, and logical puzzles. We report the percentage of model responses that contain a step-by-step solution with the correct answer. For Logical Reasoning, we use the LogiQA dataset, which consists of multi-choice logical reasoning problems. We report the accuracy on 456 problems from the test set. For Advanced Mathematics, we use the AIME (American Invitational Mathematics Examination) problems, which are challenging high-school level problems from years 2000 to 2020. We report accuracy on 135 problems. For Competition Math, we use the first 25 problems from the FrontierMath competition. These are challenging high school mathematics problems of various types. We report the percentage of problems where the model's answer matches the expected one • Self-reported

81.4%

MultiChallenge (o3-mini grader)

Accuracy AI: translation text by evaluation accuracy models AI: Accuracy • Self-reported

39.9%

Multi-IF

Accuracy AI: ChatGPT uses two methods to evaluate accuracy: comparison with ground truth and assessment by human evaluators. For straightforward tasks where correct answers are definitive (like mathematical calculations or specific factual queries), responses are compared to established ground truth. This provides quantitative accuracy metrics that can be tracked over time. For more complex, nuanced tasks where there may not be a single correct answer (like creative writing or strategic advice), human evaluators assess the quality of responses based on predefined criteria. These evaluations consider factors such as correctness, completeness, relevance, and usefulness. The system tracks accuracy metrics across different domains and tasks, allowing continuous monitoring of model performance. This data informs ongoing development efforts and helps identify areas for improvement. • Self-reported

60.9%

OpenAI-MRCR: 2 needle 128k

Accuracy AI • Self-reported

31.9%

SimpleQA

accuracy • Self-reported

38.2%

SWE-Lancer

result • Self-reported

32.6%

SWE-Lancer (IC-Diamond subset)

score • Self-reported

12.4%

TAU-bench Airline

Accuracy We we evaluate accuracy answers model on questions benchmark, which require actual knowledge and logical reasoning. We we compare answers with reference solutions, how well exactly model determines correct answer from set options or exactly answers on questions with answer. When evaluation accuracy provides whether model final and correct answer, from quality her/its explanations. Accuracy shows ability model knowledge, and not simply • Self-reported

42.8%

TAU-bench Retail

Accuracy AI: I should answer with accuracy, in order to correctly solve task. all very and in order to obtain exact answer. Human: Computation requires and AI: Fully When necessary be exact in and error in can lead to to in I always perform such computation with each step, in order to accuracy result • Self-reported

60.3%

Humanity's Last Exam

GPT-4o without mode thinking (without tools) - set questions expert level by various subjects • Self-reported

5.3%

Scale MultiChallenge

GPT-4o without mode thinking - Benchmark execution instructions • Self-reported

40.3%

License & Metadata

License

proprietary

Announcement Date

August 6, 2024

Last Updated

July 19, 2025

Similar Models

All Models

o4-mini

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$1.10/1M tokens

GPT-4.1

OpenAI

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-4o mini

OpenAI

Best score:0.9 (HumanEval)

Released:Jul 2024

Price:$0.15/1M tokens

o3

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-4.5

OpenAI

Best score:0.9 (MMLU)

Released:Feb 2025

Price:$75.00/1M tokens

GPT-5 nano

OpenAI

Best score:0.7 (GPQA)

Released:Aug 2025

Price:$0.05/1M tokens

GPT-4

OpenAI

Best score:1.0 (ARC)

Released:Jun 2023

Price:$30.00/1M tokens

GPT-5.2 Codex

OpenAI

Released:Jan 2026

Price:$1.75/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.