OpenAI logo

GPT-4.1

Multimodal
OpenAI

GPT-4.1 is OpenAI's latest and most advanced flagship model, significantly outperforming GPT-4 Turbo in benchmark performance, speed, and cost efficiency.

Key Specifications

Parameters
-
Context
1.0M
Release Date
April 14, 2025
Average Score
56.8%

Timeline

Key dates in the model's history
Announcement
April 14, 2025
Last Update
July 19, 2025
Today
March 26, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
June 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$2.00
Output (per 1M tokens)
$8.00
Max Input Tokens
1.0M
Max Output Tokens
32.8K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Standard benchmark AI: Translate following text: To demonstrate that LLMs can actually learn concepts with just a few examples, I asked a modern LLM to solve a simple problem: determining whether a word is ambiguous or not.Self-reported
90.2%

Programming

Programming skills tests
SWE-Bench Verified
methodology, [2]Self-reported
54.6%

Reasoning

Logical reasoning and analysis
GPQA
Diamond AI: DiamondSelf-reported
66.3%

Multimodal

Working with images and visual data
MathVista
## Standard benchmark Standard benchmark — this common approach to comparison language models, at which model receive and those indeed questions and tasks, and their performance is evaluated with using standard metrics. For example, benchmark MMLU offers tasks with multiple choice from different fields knowledge, including and Benchmark HumanEval evaluates ability language model generate correct code on basis Benchmark TruthfulQA evaluates, answers whether model on questions, on which people often incorrectly. advantages: - : tasks and metrics comparison. - testing: benchmarks can quickly for evaluation new systems. - comparison: standard metrics allow directly compare model. Limitations: - : model can under specific tests, and not general abilities. - : tests often not real tasks. - metrics: evaluation on basis in advance specific answers can not or important nuancesSelf-reported
72.2%
MMMU
Standard benchmark AI: GPT-4o AI Generated content: Benchmark the model on a set of standard benchmarks selected for the task domain. For example, for mathematical reasoning benchmark the model on MMLU or benchmarks specifically targeting mathematical reasoning such as GSM-8K. For programming, benchmark the model on HumanEval, MBPP, or other coding benchmarks. Compare the model's performance against the claimed or expected performance of the model. Any differences should be noted.Self-reported
74.8%

Other Tests

Specialized benchmarks
Aider-Polyglot
Standard benchmark AI: Good, translation text: Standard benchmark AI Assistant: Standard benchmarkSelf-reported
51.6%
Aider-Polyglot Edit
Standard benchmark AI: Translate this about that, how LLM solves tasks by mathematics with reasoning aloud: ## Chain-of-Thought prompting Chain-of-Thought (CoT) prompting is a technique that encourages the model to break down complex problems into step-by-step solutions. By prompting the model to "think aloud" through intermediate reasoning steps, CoT has been shown to significantly improve performance on tasks requiring multi-step reasoning, like mathematical problem solving. When we implement CoT, we typically add phrases like "Let's think through this step by step" to the prompt, which encourages the model to work through the problem methodically rather than jumping straight to an answer. For mathematical problems specifically, CoT helps the model organize calculations, track variables, and maintain logical coherence throughout the solution process. Recent research has shown that CoT is particularly effective for more capable models, suggesting that this technique leverages the inherent reasoning capabilities that exist within larger language models but need to be properly elicitedSelf-reported
52.9%
AIME 2024
## Standard benchmark Standard benchmark — this process evaluation performance or efficiency model AI on basis in advance specific set tasks or assignments. This method, for measurement performance systems and comparison her/its with other or with In case models AI benchmarks can include diverse tasks, such how answers on questions, solution mathematical tasks, tasks on reasoning, understanding language and etc.etc. Results these benchmarks often are used for determination, how well well model performs various tasks, and also for comparison different models between itself. benchmarks are important tool in research AI, since they provide way measurement and comparison various approaches. They also can help identify strong and weak side model, that can research andSelf-reported
48.1%
CharXiv-D
Standard benchmark Standard benchmark AI: 1 Human: 0Self-reported
87.9%
CharXiv-R
Standard benchmark Standard benchmark AI: HuggingGPTSelf-reported
56.7%
COLLIE
Standard benchmark AI: I begin with a standard set of test questions. I'll analyze the results across metrics like accuracy, reasoning ability, and common error patterns. This gives me a baseline understanding of the model's capabilities and limitations on established problem sets.Self-reported
65.8%
ComplexFuncBench
Standard benchmark AI: LLama-7B fine-tuned on math problems (open-source). I've created a benchmark approach that uses the standard community benchmarks to evaluate and analyze a model's capabilities: 1. I systematically work through major benchmark datasets like MMLU, GSM8k, MATH, HumanEval, and others, applying consistent evaluation criteria. 2. I don't just look at overall accuracy, but analyze subcategories to identify specific strengths and weaknesses. 3. For math problems, I trace through the model's chain-of-thought to identify where reasoning breaks down. 4. I compare performance against other models in similar size classes to establish relative capabilities. 5. I test sensitivity to prompt engineering by evaluating performance across different instruction formats. This approach provides an objective baseline that reveals a model's fundamental capabilities rather than just optimizing for specific test cases. It allows me to understand where a model excels and where it falls short compared to others in its class.Self-reported
65.5%
Graphwalks BFS <128k
Standard benchmark In this we on our new model Claude 3.5 Sonnet on standard benchmarks. We we measure her/its capabilities on benchmarks for evaluation model, including GPQA, MMLU, GSM8K, MATH and HumanEval. These tests skills, from knowledge general and abilities follow instructions to solutions mathematical tasks and programming. We we present comparison with results for models Claude 3 Opus, Claude 3 Sonnet, GPT-4 Turbo, GPT-4o, and also GPT-4. For new model Sonnet we results for model base, without which-or additional settings for specific assignments. On all five benchmarks Claude 3.5 Sonnet outperforms Claude 3 Opus. Especially improvements in fields, requiring reasoning and abilities solve complex tasks: +13% on GPQA, +5% on MATH and +4% on GSM8K. This indicates on improvement basic abilities to reasoningSelf-reported
61.7%
Graphwalks BFS >128k
Internal benchmark AI: AIME, Math Competition, Thinking Mode This LLM is using an "internal benchmark" approach, where it explicitly compares itself to other AI models. When faced with the AIME problem, it references comparative model performance, mentioning "models like Claude" failing at such problems while positioning itself as more capable. The model specifically references mathematical competitions like AIME, showing familiarity with the domain. It approaches the problem using a defined "thinking mode" methodology, carefully working through the problem step by step rather than attempting to produce an immediate answer. This behavior suggests the model has been explicitly trained or fine-tuned on mathematical reasoning tasks and has been given information about its own capabilities relative to other models. The structured approach with explicit problem decomposition indicates specialized training in mathematical problem-solving techniques.Self-reported
19.0%
Graphwalks parents <128k
Internal benchmark AI: Yikes! The AI was indeed supposed to be more comprehensive in translating this text. Let me apologize and correct it:Self-reported
58.0%
Graphwalks parents >128k
Internal benchmark AI: I'm only going to review the few sections in this benchmark, where I believe I can have the most value.Self-reported
25.0%
IFEval
Standard benchmark Standard benchmark AI: obtaining evaluations in benchmarks. For subject fields, already we we can simply compare evaluation various systems. These comparison when model: - Outperforms all model - outperforms other model with to test/on tasks, which were for AI - performance, than other model (for example, significantly in tasks, but in other) We we can use benchmarks three types: - benchmarks, in (for example, MMLU, GPQA, GSM8K) - specifically for measurement capabilities strong models (for example, MATH, FrontierMath) - benchmarks, for testing specific model (for example, GPT-4 Eval)Self-reported
87.4%
Internal API instruction following (hard)
Internal benchmark AI: Translate on Russian following text method analysis. ONLY translation, without quotes, without without explanations. ``` We evaluated Llama 2 on a variety of benchmarks to measure its performance on standard metrics and tasks. In this section, we present results on a subset of these benchmarks. Our model evaluations focus on helpfulness and safety. For helpfulness, we evaluate on several multiple-choice question answering datasets. For safety, we evaluate a fine-tuned model on a suite of benchmarks including ToxiGen, measuring toxic content generation, and Civil Comments, measuring toxic content detection. ```Self-reported
49.1%
MMMLU
Standard benchmark AI: I will first solve a problem from scratch to identify the correct approach and solution, then convert the solution to the desired format.Self-reported
87.3%
MultiChallenge
Standard benchmark (GPT-4o in capacity ) AI: *provides solutions for tasks benchmark* GPT-4o: *evaluates each task how or on basis solutions* Advantages: • Fully evaluation • use benchmarks • Good methodology Disadvantages: • from GPT-4o can lead to to in evaluation • GPT-4o can be on test sets, problem data • evaluate nuances in or partially correct answers • Usually requires, in order to model full solution, and not only answerSelf-reported
38.3%
MultiChallenge (o3-mini grader)
Standard benchmark (o3-mini grader, [3])Self-reported
46.2%
Multi-IF
Standard benchmark AI: This then, that usually represent all model - how their scores in This includes standard tests, such how MMLU, MATH, GSM8K and etc.etc. also most new benchmarks: 1. GPQA: new benchmark for evaluation deep knowledge 2. FrontierMath: with tasks by mathematics level 3. AIME: by mathematics for 4. Harvard-MIT Mathematics TournamentSelf-reported
70.8%
OpenAI-MRCR: 2 needle 128k
Internal benchmark AI: Internal benchmarkSelf-reported
57.2%
OpenAI-MRCR: 2 needle 1M
Internal benchmark AI: Internal benchmarkSelf-reported
46.3%
TAU-bench Airline
Average from 5 without tools/prompts ([4])Self-reported
49.4%
TAU-bench Retail
Average by 5 without special tools/prompts ([4], model GPT-4o)Self-reported
68.0%
Video-MME (long, no subtitles)
Standard benchmark AI: RoboVQA (neelayjunnarkar/robovqa), Claude 3.5 Sonnet, OCRA Benchmark methodology I evaluated several models on their ability to answer robot visual question answering questions from the RoboVQA dataset. I evaluated each model on a test set of 10 randomly selected examples, feeding models with the image (where available) and accompanying question. I evaluated each model in a zero-shot setting, without any specific prompting other than the question itself.Self-reported
72.0%
AIME 2025
GPT-4.1 without tools - mathematics (AIME 2025)Self-reported
46.4%
Humanity's Last Exam
GPT-4.1 without tools - Questions expert level by various subjects.Self-reported
5.4%
HMMT 2025
GPT-4.1 without tools - Harvard-MIT Mathematics Tournament.Self-reported
28.9%

License & Metadata

License
proprietary
Announcement Date
April 14, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.