GPT-4o
MultimodalGPT-4o ('o' stands for 'omni') is a multimodal AI model that accepts text, audio, image, and video inputs and generates text, audio, and image outputs. It matches GPT-4 Turbo performance on text and code, with improvements in understanding non-English languages, images, and audio.
Key Specifications
Parameters
-
Context
128.0K
Release Date
August 6, 2024
Average Score
52.8%
Timeline
Key dates in the model's history
Announcement
August 6, 2024
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$2.50
Output (per 1M tokens)
$10.00
Max Input Tokens
128.0K
Max Output Tokens
16.4K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Accuracy Model high accuracy should statements by and tasks, even at work with or for accuracy model can on benchmarks facts and knowledge, such how TruthfulQA. Accuracy also can evaluate in specific fields, for example: execution mathematical tasks, reproduction or logical errors or algorithms. By capabilities also important directly evaluate accuracy on requiring knowledge or reasoning, and not only on questions about models: - accuracy: model has significant in knowledge or when not accuracy: model usually but can errors in complex cases or at accuracy: model exact answers even in complex or situations, about its and • Self-reported
Programming
Programming skills tests
SWE-Bench Verified
Accuracy
AI: ChatGPT — 8.92
Anthropic: Claude — 9.59
Others: Google Gemini — 9.28, LLaMA3-70B — 7.39
Accuracy is perhaps the most obvious and well-known metric to measure the performance of LLMs. When considering a question-answering task or a conversation, we might use metrics like what percentage of the responses are factually correct, or what percentage of factual claims in the response are true.
Accuracy is the bedrock of trust in AI systems and the main metric used in most of the established LLM benchmarks like MMLU, GPQA, Big-Bench Hard, GSM8K, etc.
When we talk about accuracy, it's worth examining both absolute accuracy (model correctness relative to the ground truth) and comparative accuracy (how the model's knowledge compares to other models).
An advanced model doesn't need to be correct 100% of the time, but when it's wrong, we expect it to be wrong in the right ways - perhaps due to the inherent ambiguity of the question, the subjective nature of the domain, or a lack of perfect ground truth. We particularly want to avoid confident but incorrect responses (hallucinations). • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
GPT-4o - Diamond without mode thinking without tools AI: Quantum : Model GPT-4o works on level Diamond on GPQA. that such level performance without use mode thinking or tools. This that GPT-4o significantly capabilities by comparison with models. Ability GPT-4o work on level Diamond without additional methods about understanding and reasoning. Such level performance matches knowledge. Although we not we can do conclusions about level knowledge model, her/its ability solve complex tasks on GPQA without additional methods indicates on in capabilities by comparison with more models, how GPT-3.5, which tools or approaches for achievements similar results • Self-reported
Multimodal
Working with images and visual data
AI2D
Evaluation on test set AI: Translate full text • Self-reported
ChartQA
evaluation on test set • Self-reported
DocVQA
evaluation on test set • Self-reported
MathVista
Accuracy
AI: ChatGPT-4 exhibits impressive accuracy across most tasks, with very low rates of factual error or made-up information (hallucinations). It shows particularly strong performance in coding, mathematics, and reasoning tasks compared to earlier models.
The model can correctly solve complex multi-step problems in domains ranging from symbolic mathematics to physics, demonstrating solid conceptual understanding. When given well-formatted problems with clear parameters, GPT-4 achieves high accuracy rates.
However, accuracy degrades when:
• Problems require extremely specialized knowledge
• Questions contain ambiguity or poorly specified parameters
• Tasks involve very lengthy chains of reasoning where error can compound
• The model must recall specific obscure facts or handle tasks requiring perfect precision
GPT-4 sometimes demonstrates a tendency toward overconfidence in areas where its knowledge is limited. This can lead to plausible-sounding but incorrect answers. The model's ability to recognize the boundaries of its knowledge and express appropriate uncertainty remains inconsistent. • Self-reported
MMMU
GPT-4o without mode thinking - Solution visual tasks level with • Self-reported
Other Tests
Specialized benchmarks
ActivityNet
evaluation on test set • Self-reported
Aider-Polyglot
Accuracy
AI • Self-reported
Aider-Polyglot Edit
Accuracy
AI • Self-reported
AIME 2024
Accuracy
AI • Self-reported
CharXiv-D
Accuracy AI: PageRank connection how therefore with number receive more rating When receives with she/it receives part "" this is "" to • Self-reported
CharXiv-R
GPT-4o without mode thinking - justification and • Self-reported
COLLIE
GPT-4o without mode thinking - instructions at text • Self-reported
Tau2 airline
GPT-4o without mode thinking - Benchmark functions () • Self-reported
Tau2 retail
GPT-4o without mode thinking - Benchmark functions () • Self-reported
Tau2 telecom
GPT-4o without mode thinking - Benchmark functions (field) • Self-reported
MMMU-Pro
GPT-4o without mode thinking - Solution visual tasks level with reasoning • Self-reported
VideoMMMU
GPT-4o without mode thinking - multimodal reasoning (256 ) • Self-reported
ERQA
GPT-4o without mode thinking - thinking • Self-reported
ComplexFuncBench
Accuracy AI: evaluate by abilities give exact answers. In order to accuracy, I: 1. format question and level 2. information, in prompt 3. or answers 4. in answer justification 5. where this 6. its reasoning before provision final answer In I first about approach to solving, then sequentially task and its answer. In with actual I provide information and when at no knowledge • Self-reported
EgoSchema
evaluation on test set • Self-reported
Graphwalks BFS <128k
Accuracy
AI: LLaMa-3.1-405B's accuracy is acceptable but significantly behind GPT-4o, Claude 3.5 Sonnet, and Claude 3 Opus. It scores above GPT-4 and other smaller models, but is far from state-of-the-art performance.
Analysis: In most knowledge-intensive benchmarks, LLaMa-3.1-405B performs above GPT-4 but significantly below the newest models. On MMLU, it scores 86.8%, placing it above GPT-4 (83.0%) but below Claude 3 Opus (88.2%), Claude 3.5 Sonnet (90.4%), and GPT-4o (90.0%). On GPQA, it scores 56.2%, above GPT-4's 42.1% but far below Claude 3 Opus (61.1%) and GPT-4o (67.9%).
The model struggles with challenging reasoning tasks like MATH, scoring only 53.7% compared to Claude 3 Opus's 68.2% and GPT-4o's 69.4%. On GSM8K, it achieves 93.0%, which is good but below Claude 3.5 Sonnet's 97.0% and GPT-4o's 96.1%.
Its factual accuracy shows substantial improvement over previous LLaMa models but still lags behind the most accurate frontier models. While perfectly adequate for many general use cases, users requiring state-of-the-art accuracy should consider other options. • Self-reported
Graphwalks parents <128k
Accuracy
AI: ChatGPT performs well on a variety of tasks that require accuracy. It can solve mathematical problems with a high degree of precision, often providing step-by-step solutions that demonstrate a clear understanding of mathematical concepts. For example, when asked to calculate the derivative of a complex function, ChatGPT can typically provide the correct answer along with the intermediate steps.
In tasks requiring factual knowledge, ChatGPT generally provides accurate information about well-established facts, historical events, scientific concepts, and cultural phenomena. For instance, it can accurately describe the causes of World War I, explain the process of photosynthesis, or outline the plot of a classic novel.
However, ChatGPT's accuracy has limitations. It may occasionally provide incorrect information or "hallucinate" facts, especially when dealing with very specialized domains, recent events, or ambiguous queries. When asked about cutting-edge research or obscure topics, ChatGPT may fill in gaps in its knowledge with plausible-sounding but incorrect information. • Self-reported
IFEval
Accuracy
AI • Self-reported
Internal API instruction following (hard)
Accuracy AI: let's think about this step for step. ask find solution this mathematical tasks. Human: solution for log_4(x) = 3. AI: I equation log_4(x) = 3. log_4(x) = 3 means: "4 in which degree x?" need to find such number x, that 4^3 = x. : 4^3 = 4 × 4 × 4 = 64. solution equations log_4(x) = 3 x = 64. : - Answer fully correct, accuracy 100% - Reasoning correct • Self-reported
MMLU-Pro
0-shot CoT Zero-shot Chain-of-Thought (0-shot CoT) — this technique reasoning, at which LLM ask reason step by step for solutions complex tasks, not providing examples such reasoning. In order to use this method, need to simply instruction "step for step" (or ) in prompt. This allows model independently break down complex tasks on managed steps instead attempts immediately give final answer. How such approach leads to performance on tasks, requiring reasoning, especially on and logical tasks. In our research we used "Let's let's solve this step for step" before each in set data • Self-reported
MMMLU
Accuracy AI: translation: For each math reasoning ability, we define metrics that aim to measure how accurately the model is able to solve problems in those categories. In Appendix A, we provide details on the creation of all the datasets we use in evaluation. For Arithmetic, we use a dataset of 240 problems involving basic calculations using operations like addition, subtraction, multiplication, division, and exponentiation. We report the percentage of model responses that match the exact expected answer. For Symbolic Manipulation, we use a dataset of 100 problems that involve algebraic manipulation including simplification of expressions, factoring polynomials, and expanding expressions. We report the percentage of model responses that match the exact expected answer. For Analytic Solutions, we use a dataset of 100 problems that involve finding closed-form solutions to equations, integrals, and derivatives. We report the percentage of model responses that match the exact expected answer. For Step-by-Step Solutions, we use a dataset of 100 problems that involve step-by-step solutions to equations, integrals, and logical puzzles. We report the percentage of model responses that contain a step-by-step solution with the correct answer. For Logical Reasoning, we use the LogiQA dataset, which consists of multi-choice logical reasoning problems. We report the accuracy on 456 problems from the test set. For Advanced Mathematics, we use the AIME (American Invitational Mathematics Examination) problems, which are challenging high-school level problems from years 2000 to 2020. We report accuracy on 135 problems. For Competition Math, we use the first 25 problems from the FrontierMath competition. These are challenging high school mathematics problems of various types. We report the percentage of problems where the model's answer matches the expected one • Self-reported
MultiChallenge (o3-mini grader)
Accuracy AI: translation text by evaluation accuracy models AI: Accuracy • Self-reported
Multi-IF
Accuracy
AI: ChatGPT uses two methods to evaluate accuracy: comparison with ground truth and assessment by human evaluators.
For straightforward tasks where correct answers are definitive (like mathematical calculations or specific factual queries), responses are compared to established ground truth. This provides quantitative accuracy metrics that can be tracked over time.
For more complex, nuanced tasks where there may not be a single correct answer (like creative writing or strategic advice), human evaluators assess the quality of responses based on predefined criteria. These evaluations consider factors such as correctness, completeness, relevance, and usefulness.
The system tracks accuracy metrics across different domains and tasks, allowing continuous monitoring of model performance. This data informs ongoing development efforts and helps identify areas for improvement. • Self-reported
OpenAI-MRCR: 2 needle 128k
Accuracy
AI • Self-reported
SimpleQA
accuracy • Self-reported
SWE-Lancer
result • Self-reported
SWE-Lancer (IC-Diamond subset)
score • Self-reported
TAU-bench Airline
Accuracy We we evaluate accuracy answers model on questions benchmark, which require actual knowledge and logical reasoning. We we compare answers with reference solutions, how well exactly model determines correct answer from set options or exactly answers on questions with answer. When evaluation accuracy provides whether model final and correct answer, from quality her/its explanations. Accuracy shows ability model knowledge, and not simply • Self-reported
TAU-bench Retail
Accuracy AI: I should answer with accuracy, in order to correctly solve task. all very and in order to obtain exact answer. Human: Computation requires and AI: Fully When necessary be exact in and error in can lead to to in I always perform such computation with each step, in order to accuracy result • Self-reported
Humanity's Last Exam
GPT-4o without mode thinking (without tools) - set questions expert level by various subjects • Self-reported
Scale MultiChallenge
GPT-4o without mode thinking - Benchmark execution instructions • Self-reported
License & Metadata
License
proprietary
Announcement Date
August 6, 2024
Last Updated
July 19, 2025
Similar Models
All Modelso4-mini
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$1.10/1M tokens
GPT-4.1
OpenAI
MM
Best score:0.9 (MMLU)
Released:Apr 2025
Price:$2.00/1M tokens
GPT-4o mini
OpenAI
MM
Best score:0.9 (HumanEval)
Released:Jul 2024
Price:$0.15/1M tokens
o3
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$2.00/1M tokens
GPT-4.5
OpenAI
MM
Best score:0.9 (MMLU)
Released:Feb 2025
Price:$75.00/1M tokens
GPT-5 nano
OpenAI
MM
Best score:0.7 (GPQA)
Released:Aug 2025
Price:$0.05/1M tokens
GPT-4
OpenAI
MM
Best score:1.0 (ARC)
Released:Jun 2023
Price:$30.00/1M tokens
GPT-5.2 Codex
OpenAI
MM
Released:Jan 2026
Price:$1.75/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.