OpenAI logo

GPT-4o

Multimodal
OpenAI

GPT-4o ('o' stands for 'omni') is a multimodal AI model that accepts text, audio, image, and video inputs and generates text, audio, and image outputs. It matches GPT-4 Turbo performance on text and code, with improvements in understanding non-English languages, images, and audio.

Key Specifications

Parameters
-
Context
128.0K
Release Date
August 6, 2024
Average Score
52.8%

Timeline

Key dates in the model's history
Announcement
August 6, 2024
Last Update
July 19, 2025
Today
March 26, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$2.50
Output (per 1M tokens)
$10.00
Max Input Tokens
128.0K
Max Output Tokens
16.4K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Accuracy Model high accuracy should statements by and tasks, even at work with or for accuracy model can on benchmarks facts and knowledge, such how TruthfulQA. Accuracy also can evaluate in specific fields, for example: execution mathematical tasks, reproduction or logical errors or algorithms. By capabilities also important directly evaluate accuracy on requiring knowledge or reasoning, and not only on questions about models: - accuracy: model has significant in knowledge or when not accuracy: model usually but can errors in complex cases or at accuracy: model exact answers even in complex or situations, about its andSelf-reported
85.7%

Programming

Programming skills tests
SWE-Bench Verified
Accuracy AI: ChatGPT — 8.92 Anthropic: Claude — 9.59 Others: Google Gemini — 9.28, LLaMA3-70B — 7.39 Accuracy is perhaps the most obvious and well-known metric to measure the performance of LLMs. When considering a question-answering task or a conversation, we might use metrics like what percentage of the responses are factually correct, or what percentage of factual claims in the response are true. Accuracy is the bedrock of trust in AI systems and the main metric used in most of the established LLM benchmarks like MMLU, GPQA, Big-Bench Hard, GSM8K, etc. When we talk about accuracy, it's worth examining both absolute accuracy (model correctness relative to the ground truth) and comparative accuracy (how the model's knowledge compares to other models). An advanced model doesn't need to be correct 100% of the time, but when it's wrong, we expect it to be wrong in the right ways - perhaps due to the inherent ambiguity of the question, the subjective nature of the domain, or a lack of perfect ground truth. We particularly want to avoid confident but incorrect responses (hallucinations).Self-reported
33.2%

Reasoning

Logical reasoning and analysis
GPQA
GPT-4o - Diamond without mode thinking without tools AI: Quantum : Model GPT-4o works on level Diamond on GPQA. that such level performance without use mode thinking or tools. This that GPT-4o significantly capabilities by comparison with models. Ability GPT-4o work on level Diamond without additional methods about understanding and reasoning. Such level performance matches knowledge. Although we not we can do conclusions about level knowledge model, her/its ability solve complex tasks on GPQA without additional methods indicates on in capabilities by comparison with more models, how GPT-3.5, which tools or approaches for achievements similar resultsSelf-reported
70.1%

Multimodal

Working with images and visual data
AI2D
Evaluation on test set AI: Translate full textSelf-reported
94.2%
ChartQA
evaluation on test setSelf-reported
85.7%
DocVQA
evaluation on test setSelf-reported
92.8%
MathVista
Accuracy AI: ChatGPT-4 exhibits impressive accuracy across most tasks, with very low rates of factual error or made-up information (hallucinations). It shows particularly strong performance in coding, mathematics, and reasoning tasks compared to earlier models. The model can correctly solve complex multi-step problems in domains ranging from symbolic mathematics to physics, demonstrating solid conceptual understanding. When given well-formatted problems with clear parameters, GPT-4 achieves high accuracy rates. However, accuracy degrades when: • Problems require extremely specialized knowledge • Questions contain ambiguity or poorly specified parameters • Tasks involve very lengthy chains of reasoning where error can compound • The model must recall specific obscure facts or handle tasks requiring perfect precision GPT-4 sometimes demonstrates a tendency toward overconfidence in areas where its knowledge is limited. This can lead to plausible-sounding but incorrect answers. The model's ability to recognize the boundaries of its knowledge and express appropriate uncertainty remains inconsistent.Self-reported
61.4%
MMMU
GPT-4o without mode thinking - Solution visual tasks level withSelf-reported
72.2%

Other Tests

Specialized benchmarks
ActivityNet
evaluation on test setSelf-reported
61.9%
Aider-Polyglot
Accuracy AISelf-reported
30.7%
Aider-Polyglot Edit
Accuracy AISelf-reported
18.2%
AIME 2024
Accuracy AISelf-reported
13.1%
CharXiv-D
Accuracy AI: PageRank connection how therefore with number receive more rating When receives with she/it receives part "" this is "" toSelf-reported
85.3%
CharXiv-R
GPT-4o without mode thinking - justification andSelf-reported
58.8%
COLLIE
GPT-4o without mode thinking - instructions at textSelf-reported
61.0%
Tau2 airline
GPT-4o without mode thinking - Benchmark functions ()Self-reported
45.5%
Tau2 retail
GPT-4o without mode thinking - Benchmark functions ()Self-reported
63.4%
Tau2 telecom
GPT-4o without mode thinking - Benchmark functions (field)Self-reported
23.5%
MMMU-Pro
GPT-4o without mode thinking - Solution visual tasks level with reasoningSelf-reported
59.9%
VideoMMMU
GPT-4o without mode thinking - multimodal reasoning (256 )Self-reported
61.2%
ERQA
GPT-4o without mode thinking - thinkingSelf-reported
35.2%
ComplexFuncBench
Accuracy AI: evaluate by abilities give exact answers. In order to accuracy, I: 1. format question and level 2. information, in prompt 3. or answers 4. in answer justification 5. where this 6. its reasoning before provision final answer In I first about approach to solving, then sequentially task and its answer. In with actual I provide information and when at no knowledgeSelf-reported
66.5%
EgoSchema
evaluation on test setSelf-reported
72.2%
Graphwalks BFS <128k
Accuracy AI: LLaMa-3.1-405B's accuracy is acceptable but significantly behind GPT-4o, Claude 3.5 Sonnet, and Claude 3 Opus. It scores above GPT-4 and other smaller models, but is far from state-of-the-art performance. Analysis: In most knowledge-intensive benchmarks, LLaMa-3.1-405B performs above GPT-4 but significantly below the newest models. On MMLU, it scores 86.8%, placing it above GPT-4 (83.0%) but below Claude 3 Opus (88.2%), Claude 3.5 Sonnet (90.4%), and GPT-4o (90.0%). On GPQA, it scores 56.2%, above GPT-4's 42.1% but far below Claude 3 Opus (61.1%) and GPT-4o (67.9%). The model struggles with challenging reasoning tasks like MATH, scoring only 53.7% compared to Claude 3 Opus's 68.2% and GPT-4o's 69.4%. On GSM8K, it achieves 93.0%, which is good but below Claude 3.5 Sonnet's 97.0% and GPT-4o's 96.1%. Its factual accuracy shows substantial improvement over previous LLaMa models but still lags behind the most accurate frontier models. While perfectly adequate for many general use cases, users requiring state-of-the-art accuracy should consider other options.Self-reported
41.7%
Graphwalks parents <128k
Accuracy AI: ChatGPT performs well on a variety of tasks that require accuracy. It can solve mathematical problems with a high degree of precision, often providing step-by-step solutions that demonstrate a clear understanding of mathematical concepts. For example, when asked to calculate the derivative of a complex function, ChatGPT can typically provide the correct answer along with the intermediate steps. In tasks requiring factual knowledge, ChatGPT generally provides accurate information about well-established facts, historical events, scientific concepts, and cultural phenomena. For instance, it can accurately describe the causes of World War I, explain the process of photosynthesis, or outline the plot of a classic novel. However, ChatGPT's accuracy has limitations. It may occasionally provide incorrect information or "hallucinate" facts, especially when dealing with very specialized domains, recent events, or ambiguous queries. When asked about cutting-edge research or obscure topics, ChatGPT may fill in gaps in its knowledge with plausible-sounding but incorrect information.Self-reported
35.4%
IFEval
Accuracy AISelf-reported
81.0%
Internal API instruction following (hard)
Accuracy AI: let's think about this step for step. ask find solution this mathematical tasks. Human: solution for log_4(x) = 3. AI: I equation log_4(x) = 3. log_4(x) = 3 means: "4 in which degree x?" need to find such number x, that 4^3 = x. : 4^3 = 4 × 4 × 4 = 64. solution equations log_4(x) = 3 x = 64. : - Answer fully correct, accuracy 100% - Reasoning correctSelf-reported
29.2%
MMLU-Pro
0-shot CoT Zero-shot Chain-of-Thought (0-shot CoT) — this technique reasoning, at which LLM ask reason step by step for solutions complex tasks, not providing examples such reasoning. In order to use this method, need to simply instruction "step for step" (or ) in prompt. This allows model independently break down complex tasks on managed steps instead attempts immediately give final answer. How such approach leads to performance on tasks, requiring reasoning, especially on and logical tasks. In our research we used "Let's let's solve this step for step" before each in set dataSelf-reported
74.7%
MMMLU
Accuracy AI: translation: For each math reasoning ability, we define metrics that aim to measure how accurately the model is able to solve problems in those categories. In Appendix A, we provide details on the creation of all the datasets we use in evaluation. For Arithmetic, we use a dataset of 240 problems involving basic calculations using operations like addition, subtraction, multiplication, division, and exponentiation. We report the percentage of model responses that match the exact expected answer. For Symbolic Manipulation, we use a dataset of 100 problems that involve algebraic manipulation including simplification of expressions, factoring polynomials, and expanding expressions. We report the percentage of model responses that match the exact expected answer. For Analytic Solutions, we use a dataset of 100 problems that involve finding closed-form solutions to equations, integrals, and derivatives. We report the percentage of model responses that match the exact expected answer. For Step-by-Step Solutions, we use a dataset of 100 problems that involve step-by-step solutions to equations, integrals, and logical puzzles. We report the percentage of model responses that contain a step-by-step solution with the correct answer. For Logical Reasoning, we use the LogiQA dataset, which consists of multi-choice logical reasoning problems. We report the accuracy on 456 problems from the test set. For Advanced Mathematics, we use the AIME (American Invitational Mathematics Examination) problems, which are challenging high-school level problems from years 2000 to 2020. We report accuracy on 135 problems. For Competition Math, we use the first 25 problems from the FrontierMath competition. These are challenging high school mathematics problems of various types. We report the percentage of problems where the model's answer matches the expected oneSelf-reported
81.4%
MultiChallenge (o3-mini grader)
Accuracy AI: translation text by evaluation accuracy models AI: AccuracySelf-reported
39.9%
Multi-IF
Accuracy AI: ChatGPT uses two methods to evaluate accuracy: comparison with ground truth and assessment by human evaluators. For straightforward tasks where correct answers are definitive (like mathematical calculations or specific factual queries), responses are compared to established ground truth. This provides quantitative accuracy metrics that can be tracked over time. For more complex, nuanced tasks where there may not be a single correct answer (like creative writing or strategic advice), human evaluators assess the quality of responses based on predefined criteria. These evaluations consider factors such as correctness, completeness, relevance, and usefulness. The system tracks accuracy metrics across different domains and tasks, allowing continuous monitoring of model performance. This data informs ongoing development efforts and helps identify areas for improvement.Self-reported
60.9%
OpenAI-MRCR: 2 needle 128k
Accuracy AISelf-reported
31.9%
SimpleQA
accuracySelf-reported
38.2%
SWE-Lancer
resultSelf-reported
32.6%
SWE-Lancer (IC-Diamond subset)
scoreSelf-reported
12.4%
TAU-bench Airline
Accuracy We we evaluate accuracy answers model on questions benchmark, which require actual knowledge and logical reasoning. We we compare answers with reference solutions, how well exactly model determines correct answer from set options or exactly answers on questions with answer. When evaluation accuracy provides whether model final and correct answer, from quality her/its explanations. Accuracy shows ability model knowledge, and not simplySelf-reported
42.8%
TAU-bench Retail
Accuracy AI: I should answer with accuracy, in order to correctly solve task. all very and in order to obtain exact answer. Human: Computation requires and AI: Fully When necessary be exact in and error in can lead to to in I always perform such computation with each step, in order to accuracy resultSelf-reported
60.3%
Humanity's Last Exam
GPT-4o without mode thinking (without tools) - set questions expert level by various subjectsSelf-reported
5.3%
Scale MultiChallenge
GPT-4o without mode thinking - Benchmark execution instructionsSelf-reported
40.3%

License & Metadata

License
proprietary
Announcement Date
August 6, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.