GPT-4.1

Name: GPT-4.1
Author: OpenAI

Multimodal

OpenAI

GPT-4.1 is OpenAI's latest and most advanced flagship model, significantly outperforming GPT-4 Turbo in benchmark performance, speed, and cost efficiency.

Key Specifications

Parameters

Context

1.0M

Release Date

April 14, 2025

Average Score

56.8%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

April 14, 2025

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

June 1, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$2.00

Output (per 1M tokens)

$8.00

Max Input Tokens

1.0M

Max Output Tokens

32.8K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Standard benchmark AI: Translate following text: To demonstrate that LLMs can actually learn concepts with just a few examples, I asked a modern LLM to solve a simple problem: determining whether a word is ambiguous or not. • Self-reported

90.2%

Programming

Programming skills tests

SWE-Bench Verified

methodology, [2] • Self-reported

54.6%

Reasoning

Logical reasoning and analysis

GPQA

Diamond AI: Diamond • Self-reported

66.3%

Multimodal

Working with images and visual data

MathVista

## Standard benchmark Standard benchmark — this common approach to comparison language models, at which model receive and those indeed questions and tasks, and their performance is evaluated with using standard metrics. For example, benchmark MMLU offers tasks with multiple choice from different fields knowledge, including and Benchmark HumanEval evaluates ability language model generate correct code on basis Benchmark TruthfulQA evaluates, answers whether model on questions, on which people often incorrectly. advantages: - : tasks and metrics comparison. - testing: benchmarks can quickly for evaluation new systems. - comparison: standard metrics allow directly compare model. Limitations: - : model can under specific tests, and not general abilities. - : tests often not real tasks. - metrics: evaluation on basis in advance specific answers can not or important nuances • Self-reported

72.2%

MMMU

Standard benchmark AI: GPT-4o AI Generated content: Benchmark the model on a set of standard benchmarks selected for the task domain. For example, for mathematical reasoning benchmark the model on MMLU or benchmarks specifically targeting mathematical reasoning such as GSM-8K. For programming, benchmark the model on HumanEval, MBPP, or other coding benchmarks. Compare the model's performance against the claimed or expected performance of the model. Any differences should be noted. • Self-reported

74.8%

Other Tests

Specialized benchmarks

Aider-Polyglot

Standard benchmark AI: Good, translation text: Standard benchmark AI Assistant: Standard benchmark • Self-reported

51.6%

Aider-Polyglot Edit

Standard benchmark AI: Translate this about that, how LLM solves tasks by mathematics with reasoning aloud: ## Chain-of-Thought prompting Chain-of-Thought (CoT) prompting is a technique that encourages the model to break down complex problems into step-by-step solutions. By prompting the model to "think aloud" through intermediate reasoning steps, CoT has been shown to significantly improve performance on tasks requiring multi-step reasoning, like mathematical problem solving. When we implement CoT, we typically add phrases like "Let's think through this step by step" to the prompt, which encourages the model to work through the problem methodically rather than jumping straight to an answer. For mathematical problems specifically, CoT helps the model organize calculations, track variables, and maintain logical coherence throughout the solution process. Recent research has shown that CoT is particularly effective for more capable models, suggesting that this technique leverages the inherent reasoning capabilities that exist within larger language models but need to be properly elicited • Self-reported

52.9%

AIME 2024

## Standard benchmark Standard benchmark — this process evaluation performance or efficiency model AI on basis in advance specific set tasks or assignments. This method, for measurement performance systems and comparison her/its with other or with In case models AI benchmarks can include diverse tasks, such how answers on questions, solution mathematical tasks, tasks on reasoning, understanding language and etc.etc. Results these benchmarks often are used for determination, how well well model performs various tasks, and also for comparison different models between itself. benchmarks are important tool in research AI, since they provide way measurement and comparison various approaches. They also can help identify strong and weak side model, that can research and • Self-reported

48.1%

CharXiv-D

Standard benchmark Standard benchmark AI: 1 Human: 0 • Self-reported

87.9%

CharXiv-R

Standard benchmark Standard benchmark AI: HuggingGPT • Self-reported

56.7%

COLLIE

Standard benchmark AI: I begin with a standard set of test questions. I'll analyze the results across metrics like accuracy, reasoning ability, and common error patterns. This gives me a baseline understanding of the model's capabilities and limitations on established problem sets. • Self-reported

65.8%

ComplexFuncBench

Standard benchmark AI: LLama-7B fine-tuned on math problems (open-source). I've created a benchmark approach that uses the standard community benchmarks to evaluate and analyze a model's capabilities: 1. I systematically work through major benchmark datasets like MMLU, GSM8k, MATH, HumanEval, and others, applying consistent evaluation criteria. 2. I don't just look at overall accuracy, but analyze subcategories to identify specific strengths and weaknesses. 3. For math problems, I trace through the model's chain-of-thought to identify where reasoning breaks down. 4. I compare performance against other models in similar size classes to establish relative capabilities. 5. I test sensitivity to prompt engineering by evaluating performance across different instruction formats. This approach provides an objective baseline that reveals a model's fundamental capabilities rather than just optimizing for specific test cases. It allows me to understand where a model excels and where it falls short compared to others in its class. • Self-reported

65.5%

Graphwalks BFS <128k

Standard benchmark In this we on our new model Claude 3.5 Sonnet on standard benchmarks. We we measure her/its capabilities on benchmarks for evaluation model, including GPQA, MMLU, GSM8K, MATH and HumanEval. These tests skills, from knowledge general and abilities follow instructions to solutions mathematical tasks and programming. We we present comparison with results for models Claude 3 Opus, Claude 3 Sonnet, GPT-4 Turbo, GPT-4o, and also GPT-4. For new model Sonnet we results for model base, without which-or additional settings for specific assignments. On all five benchmarks Claude 3.5 Sonnet outperforms Claude 3 Opus. Especially improvements in fields, requiring reasoning and abilities solve complex tasks: +13% on GPQA, +5% on MATH and +4% on GSM8K. This indicates on improvement basic abilities to reasoning • Self-reported

61.7%

Graphwalks BFS >128k

Internal benchmark AI: AIME, Math Competition, Thinking Mode This LLM is using an "internal benchmark" approach, where it explicitly compares itself to other AI models. When faced with the AIME problem, it references comparative model performance, mentioning "models like Claude" failing at such problems while positioning itself as more capable. The model specifically references mathematical competitions like AIME, showing familiarity with the domain. It approaches the problem using a defined "thinking mode" methodology, carefully working through the problem step by step rather than attempting to produce an immediate answer. This behavior suggests the model has been explicitly trained or fine-tuned on mathematical reasoning tasks and has been given information about its own capabilities relative to other models. The structured approach with explicit problem decomposition indicates specialized training in mathematical problem-solving techniques. • Self-reported

19.0%

Graphwalks parents <128k

Internal benchmark AI: Yikes! The AI was indeed supposed to be more comprehensive in translating this text. Let me apologize and correct it: • Self-reported

58.0%

Graphwalks parents >128k

Internal benchmark AI: I'm only going to review the few sections in this benchmark, where I believe I can have the most value. • Self-reported

25.0%

IFEval

Standard benchmark Standard benchmark AI: obtaining evaluations in benchmarks. For subject fields, already we we can simply compare evaluation various systems. These comparison when model: - Outperforms all model - outperforms other model with to test/on tasks, which were for AI - performance, than other model (for example, significantly in tasks, but in other) We we can use benchmarks three types: - benchmarks, in (for example, MMLU, GPQA, GSM8K) - specifically for measurement capabilities strong models (for example, MATH, FrontierMath) - benchmarks, for testing specific model (for example, GPT-4 Eval) • Self-reported

87.4%

Internal API instruction following (hard)

Internal benchmark AI: Translate on Russian following text method analysis. ONLY translation, without quotes, without without explanations. ``` We evaluated Llama 2 on a variety of benchmarks to measure its performance on standard metrics and tasks. In this section, we present results on a subset of these benchmarks. Our model evaluations focus on helpfulness and safety. For helpfulness, we evaluate on several multiple-choice question answering datasets. For safety, we evaluate a fine-tuned model on a suite of benchmarks including ToxiGen, measuring toxic content generation, and Civil Comments, measuring toxic content detection. ``` • Self-reported

49.1%

MMMLU

Standard benchmark AI: I will first solve a problem from scratch to identify the correct approach and solution, then convert the solution to the desired format. • Self-reported

87.3%

MultiChallenge

Standard benchmark (GPT-4o in capacity ) AI: *provides solutions for tasks benchmark* GPT-4o: *evaluates each task how or on basis solutions* Advantages: • Fully evaluation • use benchmarks • Good methodology Disadvantages: • from GPT-4o can lead to to in evaluation • GPT-4o can be on test sets, problem data • evaluate nuances in or partially correct answers • Usually requires, in order to model full solution, and not only answer • Self-reported

38.3%

MultiChallenge (o3-mini grader)

Standard benchmark (o3-mini grader, [3]) • Self-reported

46.2%

Multi-IF

Standard benchmark AI: This then, that usually represent all model - how their scores in This includes standard tests, such how MMLU, MATH, GSM8K and etc.etc. also most new benchmarks: 1. GPQA: new benchmark for evaluation deep knowledge 2. FrontierMath: with tasks by mathematics level 3. AIME: by mathematics for 4. Harvard-MIT Mathematics Tournament • Self-reported

70.8%

OpenAI-MRCR: 2 needle 128k

Internal benchmark AI: Internal benchmark • Self-reported

57.2%

OpenAI-MRCR: 2 needle 1M

Internal benchmark AI: Internal benchmark • Self-reported

46.3%

TAU-bench Airline

Average from 5 without tools/prompts ([4]) • Self-reported

49.4%

TAU-bench Retail

Average by 5 without special tools/prompts ([4], model GPT-4o) • Self-reported

68.0%

Video-MME (long, no subtitles)

Standard benchmark AI: RoboVQA (neelayjunnarkar/robovqa), Claude 3.5 Sonnet, OCRA Benchmark methodology I evaluated several models on their ability to answer robot visual question answering questions from the RoboVQA dataset. I evaluated each model on a test set of 10 randomly selected examples, feeding models with the image (where available) and accompanying question. I evaluated each model in a zero-shot setting, without any specific prompting other than the question itself. • Self-reported

72.0%

AIME 2025

GPT-4.1 without tools - mathematics (AIME 2025) • Self-reported

46.4%

Humanity's Last Exam

GPT-4.1 without tools - Questions expert level by various subjects. • Self-reported

5.4%

HMMT 2025

GPT-4.1 without tools - Harvard-MIT Mathematics Tournament. • Self-reported

28.9%

License & Metadata

License

proprietary

Announcement Date

April 14, 2025

Last Updated

July 19, 2025

Similar Models

All Models

o4-mini

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$1.10/1M tokens

GPT-4o

OpenAI

Best score:0.9 (MMLU)

Released:Aug 2024

Price:$2.50/1M tokens

GPT-4o mini

OpenAI

Best score:0.9 (HumanEval)

Released:Jul 2024

Price:$0.15/1M tokens

o3

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-4.5

OpenAI

Best score:0.9 (MMLU)

Released:Feb 2025

Price:$75.00/1M tokens

GPT-5 nano

OpenAI

Best score:0.7 (GPQA)

Released:Aug 2025

Price:$0.05/1M tokens

GPT-4

OpenAI

Best score:1.0 (ARC)

Released:Jun 2023

Price:$30.00/1M tokens

GPT-5.1 Codex

OpenAI

Released:Nov 2025

Price:$1.25/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.