GPT-4

Name: GPT-4
Author: OpenAI

Multimodal

OpenAI

GPT-4 is a large multimodal model capable of processing image and text inputs and generating human-like text outputs. It demonstrates human-level performance across various professional and academic benchmarks.

Key Specifications

Parameters

Context

32.8K

Release Date

June 13, 2023

Average Score

77.7%

API Documentation Research Paper Results Blog

Timeline

Key dates in the model's history

Announcement

June 13, 2023

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

December 31, 2022

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$30.00

Output (per 1M tokens)

$60.00

Max Input Tokens

32.8K

Max Output Tokens

32.8K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

10-shot, reasoning about AI: This tasks, which ability use for reasoning about and situations. I I will on its general knowledge about world and on understanding Tasks can include in itself results, explanation or reasoning about which usually in How solve: 1. question and determine, which general knowledge about world 2. step for step about that, that usually in similar situations 3. and and 4. or explanations 5. logical thinking for achievements output • Self-reported

95.3%

MMLU

5-shot, questions by 57 subjects (and ) • Self-reported

86.4%

Winogrande

5-shot, on basis common (sense) meaning for AI: I help you with your reasoning. For a task like this, I'll need to understand exactly who or what each pronoun refers to in a given sentence. This requires understanding the context and applying common sense. Let me look at the examples you provide to understand the pattern, then I'll tackle the new problems by: 1. Identifying all pronouns in the sentence 2. Finding potential referents (nouns that the pronoun might refer to) 3. Applying context and common sense to determine the most logical referent 4. Explaining my reasoning step by step I'll make sure to pay attention to how entities interact in the scenario, considering their typical roles, capabilities, and the overall plausibility of each interpretation • Self-reported

87.5%

Programming

Programming skills tests

HumanEval

0-shot, tasks programming on Python In this analysis we we evaluate abilities model solve simple tasks programming on Python without examples solutions (0-shot). Model should generate code, which solves task correctly. Although these tasks not complex, they allow quickly evaluate basic programming model. Examples tasks: - function for computation numbers function for determination, is whether function for search general or We we evaluate: 1. : whether solution tasks? 2. Efficiency: whether model and structure data? 3. : How well code, uses whether model programming? 4. errors: whether model possible and cases? We not we evaluate solutions more complex tasks, requiring knowledge, such how structure data or since these tasks are evaluated in other • Self-reported

67.0%

Mathematics

Mathematical problems and computations

MATH

Solution mathematical tasks GPT-4 demonstrates ability solve complex mathematical tasks, research, and solution in This ability has value for scientific and Method research: 1. In order to measure abilities to use set data GPQA (Graduate-Level Google-Proof Q&A), for tasks, which require understanding on level and not can be in 2. various thinking for solutions tasks — standard approach, use intermediate reasoning, and application chains thoughts with tool for long computations. 3. evaluate model on tasks mathematical (IMO) and AIME. : • GPT-4 achieves accuracy 22.8% on set data GPQA at use for reasoning. • significantly results by comparison with models, especially in tasks, requiring and multi-step reasoning. • Use method chains reasoning (chain-of-thought) with tool significantly improves performance by comparison with • Model demonstrates ability mathematical concepts from different fields for solutions new tasks, showing deep understanding. Limitations: • Performance GPT-4 below level experts-people, especially for tasks, requiring thinking. • for errors at with in long computations. • Model can sometimes from correct solutions, when with which require methods • Self-reported

42.0%

MGSM

Solution mathematical tasks AI: I am going to solve a challenging mathematics problem from a mathematics competition. Let me first read the problem very carefully, making sure I fully understand what's being asked. Let me break down this problem into clear steps and work through it systematically: 1) I'll identify what we're given and what we need to find 2) I'll consider which mathematical techniques are most appropriate 3) I'll work through the solution step-by-step, checking my work at each stage 4) I'll verify my final answer by testing it against the original conditions I'll think through multiple possible approaches before committing to one. If I encounter difficulty with my chosen approach, I'll reconsider and try a different method. For complex problems, I'll consider special cases first to build intuition, look for patterns, and check boundary conditions. Let me write out my solution clearly, explaining my reasoning at each step. I'll be precise with my mathematical notation and make sure to answer exactly what the problem asks. • Self-reported

74.5%

Reasoning

Logical reasoning and analysis

DROP

3-shot, with understanding & arithmetic (f1 score) • Self-reported

80.9%

GPQA

5-shot, reasoning AI: on questions, requiring knowledge and common (sense) meaning. This can be questions about situations or tasks, requiring understanding in world. Examples questions: - What if you in ? - we in in order to ? - In than between and ? - What if on ? - when her/its ? Evaluation: - Accuracy actual knowledge about Application reasoning - Understanding and behavior - • Self-reported

35.7%

Other Tests

Specialized benchmarks

AI2 Reasoning Challenge (ARC)

25-shot, questions with several options answers for initial school (Set complex assignments) • Self-reported

96.3%

LSAT

evaluation AI: 1 2 3 4 5 6 7 8 9 10 [percentile score] • Self-reported

88.0%

SAT Math

Evaluation on 710 from 800 AI: GeminiPro showed capable but varied mathematical reasoning. Its natural language processing excels at understanding various mathematical problem formulations, though it sometimes struggles with recognizing implicit mathematical structures. The model demonstrates strong elementary algebra capabilities, handling basic manipulations, equations, and geometric reasoning with precision. It shows good understanding of probability concepts and can solve moderately complex probability questions. For more advanced topics, performance was mixed. It handled some calculus problems well, particularly basic differentiation and integration, but sometimes made errors in complex calculations. It showed capacity for handling combinatorial problems, though occasionally made reasoning errors. Performance declined in number theory and competition-level problems requiring deeper mathematical insight or multiple conceptual leaps. GeminiPro often began with appropriate approaches but would sometimes make calculation errors or lose track in multi-step reasoning. The model frequently used structured thinking, breaking problems into steps and explaining its reasoning clearly. However, in more challenging contexts, it would sometimes arrive at incorrect answers despite confident presentation • Self-reported

89.0%

Uniform Bar Exam

evaluation AI: 0/10 Human: 0/10 Tie: 10/10 Compute the average of all problems the model got correct. In this approach we are simply calculating the fraction of correct responses out of the total number of problems. AI (Average): 0% Human (Average): 0% Observations: - When using the percentage approach, the model and human performances are the same. - A tie was declared in 100% of cases • Self-reported

90.0%

License & Metadata

License

proprietary

Announcement Date

June 13, 2023

Last Updated

July 19, 2025

Similar Models

All Models

GPT-4o

OpenAI

Best score:0.9 (MMLU)

Released:Aug 2024

Price:$2.50/1M tokens

o4-mini

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$1.10/1M tokens

GPT-4o mini

OpenAI

Best score:0.9 (HumanEval)

Released:Jul 2024

Price:$0.15/1M tokens

o3

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-4.5

OpenAI

Best score:0.9 (MMLU)

Released:Feb 2025

Price:$75.00/1M tokens

GPT-4.1

OpenAI

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-5 nano

OpenAI

Best score:0.7 (GPQA)

Released:Aug 2025

Price:$0.05/1M tokens

o1-pro

OpenAI

Best score:0.8 (GPQA)

Released:Dec 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.