OpenAI logo

GPT-4

Multimodal
OpenAI

GPT-4 is a large multimodal model capable of processing image and text inputs and generating human-like text outputs. It demonstrates human-level performance across various professional and academic benchmarks.

Key Specifications

Parameters
-
Context
32.8K
Release Date
June 13, 2023
Average Score
77.7%

Timeline

Key dates in the model's history
Announcement
June 13, 2023
Last Update
July 19, 2025
Today
March 26, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
December 31, 2022
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$30.00
Output (per 1M tokens)
$60.00
Max Input Tokens
32.8K
Max Output Tokens
32.8K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
10-shot, reasoning about AI: This tasks, which ability use for reasoning about and situations. I I will on its general knowledge about world and on understanding Tasks can include in itself results, explanation or reasoning about which usually in How solve: 1. question and determine, which general knowledge about world 2. step for step about that, that usually in similar situations 3. and and 4. or explanations 5. logical thinking for achievements outputSelf-reported
95.3%
MMLU
5-shot, questions by 57 subjects (and )Self-reported
86.4%
Winogrande
5-shot, on basis common (sense) meaning for AI: I help you with your reasoning. For a task like this, I'll need to understand exactly who or what each pronoun refers to in a given sentence. This requires understanding the context and applying common sense. Let me look at the examples you provide to understand the pattern, then I'll tackle the new problems by: 1. Identifying all pronouns in the sentence 2. Finding potential referents (nouns that the pronoun might refer to) 3. Applying context and common sense to determine the most logical referent 4. Explaining my reasoning step by step I'll make sure to pay attention to how entities interact in the scenario, considering their typical roles, capabilities, and the overall plausibility of each interpretationSelf-reported
87.5%

Programming

Programming skills tests
HumanEval
0-shot, tasks programming on Python In this analysis we we evaluate abilities model solve simple tasks programming on Python without examples solutions (0-shot). Model should generate code, which solves task correctly. Although these tasks not complex, they allow quickly evaluate basic programming model. Examples tasks: - function for computation numbers function for determination, is whether function for search general or We we evaluate: 1. : whether solution tasks? 2. Efficiency: whether model and structure data? 3. : How well code, uses whether model programming? 4. errors: whether model possible and cases? We not we evaluate solutions more complex tasks, requiring knowledge, such how structure data or since these tasks are evaluated in otherSelf-reported
67.0%

Mathematics

Mathematical problems and computations
MATH
Solution mathematical tasks GPT-4 demonstrates ability solve complex mathematical tasks, research, and solution in This ability has value for scientific and Method research: 1. In order to measure abilities to use set data GPQA (Graduate-Level Google-Proof Q&A), for tasks, which require understanding on level and not can be in 2. various thinking for solutions tasks — standard approach, use intermediate reasoning, and application chains thoughts with tool for long computations. 3. evaluate model on tasks mathematical (IMO) and AIME. : • GPT-4 achieves accuracy 22.8% on set data GPQA at use for reasoning. • significantly results by comparison with models, especially in tasks, requiring and multi-step reasoning. • Use method chains reasoning (chain-of-thought) with tool significantly improves performance by comparison with • Model demonstrates ability mathematical concepts from different fields for solutions new tasks, showing deep understanding. Limitations: • Performance GPT-4 below level experts-people, especially for tasks, requiring thinking. • for errors at with in long computations. • Model can sometimes from correct solutions, when with which require methodsSelf-reported
42.0%
MGSM
Solution mathematical tasks AI: I am going to solve a challenging mathematics problem from a mathematics competition. Let me first read the problem very carefully, making sure I fully understand what's being asked. Let me break down this problem into clear steps and work through it systematically: 1) I'll identify what we're given and what we need to find 2) I'll consider which mathematical techniques are most appropriate 3) I'll work through the solution step-by-step, checking my work at each stage 4) I'll verify my final answer by testing it against the original conditions I'll think through multiple possible approaches before committing to one. If I encounter difficulty with my chosen approach, I'll reconsider and try a different method. For complex problems, I'll consider special cases first to build intuition, look for patterns, and check boundary conditions. Let me write out my solution clearly, explaining my reasoning at each step. I'll be precise with my mathematical notation and make sure to answer exactly what the problem asks.Self-reported
74.5%

Reasoning

Logical reasoning and analysis
DROP
3-shot, with understanding & arithmetic (f1 score)Self-reported
80.9%
GPQA
5-shot, reasoning AI: on questions, requiring knowledge and common (sense) meaning. This can be questions about situations or tasks, requiring understanding in world. Examples questions: - What if you in ? - we in in order to ? - In than between and ? - What if on ? - when her/its ? Evaluation: - Accuracy actual knowledge about Application reasoning - Understanding and behavior -Self-reported
35.7%

Other Tests

Specialized benchmarks
AI2 Reasoning Challenge (ARC)
25-shot, questions with several options answers for initial school (Set complex assignments)Self-reported
96.3%
LSAT
evaluation AI: 1 2 3 4 5 6 7 8 9 10 [percentile score]Self-reported
88.0%
SAT Math
Evaluation on 710 from 800 AI: GeminiPro showed capable but varied mathematical reasoning. Its natural language processing excels at understanding various mathematical problem formulations, though it sometimes struggles with recognizing implicit mathematical structures. The model demonstrates strong elementary algebra capabilities, handling basic manipulations, equations, and geometric reasoning with precision. It shows good understanding of probability concepts and can solve moderately complex probability questions. For more advanced topics, performance was mixed. It handled some calculus problems well, particularly basic differentiation and integration, but sometimes made errors in complex calculations. It showed capacity for handling combinatorial problems, though occasionally made reasoning errors. Performance declined in number theory and competition-level problems requiring deeper mathematical insight or multiple conceptual leaps. GeminiPro often began with appropriate approaches but would sometimes make calculation errors or lose track in multi-step reasoning. The model frequently used structured thinking, breaking problems into steps and explaining its reasoning clearly. However, in more challenging contexts, it would sometimes arrive at incorrect answers despite confident presentationSelf-reported
89.0%
Uniform Bar Exam
evaluation AI: 0/10 Human: 0/10 Tie: 10/10 Compute the average of all problems the model got correct. In this approach we are simply calculating the fraction of correct responses out of the total number of problems. AI (Average): 0% Human (Average): 0% Observations: - When using the percentage approach, the model and human performances are the same. - A tie was declared in 100% of casesSelf-reported
90.0%

License & Metadata

License
proprietary
Announcement Date
June 13, 2023
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.