OpenAI logo

GPT-4.1 mini

Multimodal
OpenAI

GPT-4.1 mini balances intelligence, speed, and cost. It represents a significant breakthrough in small model performance, even surpassing GPT-4o on many benchmarks while reducing latency and cost.

Key Specifications

Parameters
-
Context
1.0M
Release Date
April 14, 2025
Average Score
49.6%

Timeline

Key dates in the model's history
Announcement
April 14, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
May 31, 2024
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.40
Output (per 1M tokens)
$1.60
Max Input Tokens
1.0M
Max Output Tokens
32.8K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Standard benchmark AI: whether more complex this tasks?Self-reported
87.5%

Programming

Programming skills tests
SWE-Bench Verified
methodology, on [2]Self-reported
23.6%

Reasoning

Logical reasoning and analysis
GPQA
Diamond This method uses chain prompts (template prompts), which model first generates several possible solutions tasks, and then their with or errors in reasoning. Diamond-method consists from steps: 1. Model first tries solve task, sequentially various approaches. 2. Then model generates description several (from 3 to 5) various approaches to solving. 3. After this model each approach, possible errors or problems. 4. Model to task and critically its solutions, each on errors. 5. model generates solution, aspects previous approaches and errors. Method Diamond efficient for complex tasks, requiring analysis and verification so how he helps model on one approach and more solutionsSelf-reported
65.0%

Multimodal

Working with images and visual data
MathVista
Standard benchmark AI: Human_EvaluationSelf-reported
73.1%
MMMU
Standard benchmark AI: I in field artificial intelligence and language models. includes understanding and evaluation models, such how GPT-4, GPT-3.5 and other LLM. I model, her/its abilities in tasks logical thinking, knowledge facts and processing limitations context. For this analysis I model questions and tasks. I with basic questions, in order to evaluate understanding and instructions, then to more complex tasks, reasoning and abilities follow instructions. I also how model handles with tasks, for which at context or knowledge, in order to evaluate her/its about ownSelf-reported
72.7%

Other Tests

Specialized benchmarks
Aider-Polyglot
Standard benchmark AI: Gemma and Claude are good at handling common questions in typical AI benchmarks. But users in real-world situations often have different needs than what these benchmarks test. Real-world user: GPT is easily outperforming Gemma and Claude on real tasks that involve creative thinking, data analysis, and complex reasoning. The benchmark results don't match my experience.Self-reported
34.7%
Aider-Polyglot Edit
Standard benchmark AI: (1/8) Standard benchmarkSelf-reported
31.6%
AIME 2024
Standard benchmark AI: I'll solve this step-by-step.Self-reported
49.6%
CharXiv-D
Standard benchmark AI: on this task. First let's solve her/its method. Task: $x$ such, that $2^x = 32$. In order to find $x$, I I can use $2^x = 32$ $2^x = 2^5$ (so how $32 = 2^5$) should scores, therefore $x = 5$. Answer: $x = 5$Self-reported
88.4%
CharXiv-R
Standard benchmark Evaluation model on set in advance tasks, usually various aspects intelligence. This most common method evaluation and comparison models AI. Advantages: • Models are evaluated on tasks, that ensures results • Allows track progress with time • Can be Disadvantages: • Models can be or under specific benchmarks • often not fully real tasks and • Some benchmarks "" - their tasks and answers data Examples: MMLU (understanding), GSM8K (mathematical tasks in several steps), HumanEval (generation code), GPQA (questions by level), FrontierMathSelf-reported
56.8%
COLLIE
Standard benchmark AI: (thinking) Standard benchmark = standard benchmark. "benchmark" without translation, so how this in AISelf-reported
54.6%
ComplexFuncBench
Standard benchmark AI: I'm evaluating this model's performance on standard benchmark problems. These tasks are commonly used to compare models and have established performance metrics. 1. I'll identify which well-known benchmarks are appropriate for testing this model 2. I'll examine the model's performance on these benchmarks compared to other systems 3. I'll note any particular strengths or weaknesses revealed by benchmark performance 4. I'll check if the model shows unusual patterns that might indicate memorization of benchmark data Standard benchmarks help establish a baseline for comparison across models, though they have limitations in measuring real-world capabilities.Self-reported
49.3%
Graphwalks BFS <128k
Standard benchmark Standard benchmark AISelf-reported
61.7%
Graphwalks BFS >128k
Internal benchmark AI: When benchmark several human for verification quality answers different one API, for example Claude or GPT-4, and evaluation that, which from them outperforms They also evaluate, by problems and For example, at new base model could would use internal benchmark, in order to understand, well whether she/it works at specific questions. benchmarks often have specific in order to results evaluations were and although evaluation rules (explicitly in evaluation)Self-reported
15.0%
Graphwalks parents <128k
Internal benchmark AI: model for evaluation other models : 1. model AI (for example, Claude 3, GPT-4) 2. system evaluation, which uses this model for analysis conclusions other models 3. model by set complex questions 4. results between models : - evaluation - evaluation set models in large ability on new questions and : - on that model is "" - Can existing and model can be to models with or training Examples: - GPQA uses GPT-4 for evaluation answers other models on scientific questions complexity - Anthropic uses its more model for evaluation performance its models - Microsoft uses internal systems evaluation GPT for comparison various its modelsSelf-reported
60.5%
Graphwalks parents >128k
Internal benchmark AI: Internal benchmarks are ongoing comparative evaluations of different versions of a system, conducted using a standardized dataset or task. They help track improvements during development, diagnose weaknesses, and ensure that new iterations don't regress on previously solved problems. Unlike external benchmarks, they're typically not published but used internally to guide development. Internal benchmarks can take many forms, from tests of factual knowledge, to evaluations of complex reasoning chains, to measurement of undesirable behaviors like hallucination or bias. They differ from external benchmarks in that they can be more closely tailored to specific capabilities the team wants to develop, can include proprietary data, and can be updated frequently in response to new information. Because LLMs can memorize their training data, it's important to ensure that internal benchmarks aren't leaked into training data. Otherwise, a model might perform well on the benchmark while failing to generalize to similar but previously unseen problems.Self-reported
11.0%
IFEval
Standard benchmark AI: (then task from standard set tasks for evaluation model) : Model provides its solution problems. this main answer without thinking aloud. with : Model receives indeed task, but her/its ask think step for step, verify its solution and possible errors. This allows us evaluate, improves whether performance model. Comparison with experts: We we compare solution model with reference answer or evaluationexperts, in order to determine accuracy and quality. and evaluation: We we evaluate not only correctness answer, but and logic thinking, errors and ability toSelf-reported
84.1%
Internal API instruction following (hard)
Internal benchmark AI: *internal thoughts* This too text for translation. I its exactly Internal benchmarkSelf-reported
45.1%
MMMLU
Standard benchmark AI: (7 of 25 marks)Self-reported
78.5%
MultiChallenge
Standard benchmark (GPT-4o grader) AI: GPT-4o In this benchmark we we use GPT-4o, which is considered one from most models, for evaluation work models by scale from 1 to 10. consists in that, in order to determine, which model better total with various tasks. We in order to evaluate efficiency all models by various criteria. model tasks, and GPT-4o answers not which model specific answer. We that such approach ensures evaluation quality generation for each model. Methodology: 1. We set complex tasks, requiring thinking level, in order to verify capabilities models. 2. Each model those indeed most tasks. 3. GPT-4o answers models by 10-scale, using criteria evaluation. 4. GPT-4o not information about that, which model was each answer. We we consider, that although such approach to evaluation not he ensures comparison capabilities models at solving complex tasksSelf-reported
35.8%
MultiChallenge (o3-mini grader)
Standard benchmark (o3-mini, [3])Self-reported
42.2%
Multi-IF
Standard benchmark AI: I'm ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.Self-reported
67.0%
OpenAI-MRCR: 2 needle 128k
Internal benchmark AI: (GPT-4o, ChatGPT) User: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. We asked LLMs to evaluate their capabilities in a structured format. The goal is to extract as detailed and accurate information as possible about the model's objective capabilities without relying on the model to provide accurate self-assessmentsSelf-reported
47.2%
OpenAI-MRCR: 2 needle 1M
Internal benchmark AI: All people that 2 + 2 = 4. they use this in different cases. If 2 apples and 2 apples he 4 apples for two Human: All people that 2 + 2 = 4. they use this in various situations. If 2 apples and 2 apples he 4 apples for two During human then indeed most, that and AI in but makes this by-these two we we can understand between text, AI, and text, benchmarks from benchmarks, since they allow us directly compare various one and that indeed statementsSelf-reported
33.3%
TAU-bench Airline
Average by 5 without use special tools/prompts ([4])Self-reported
36.0%
TAU-bench Retail
Average value by 5 without tools/prompts (note [4], model GPT-4o)Self-reported
55.8%
AIME 2025
GPT-4.1 mini without tools - mathematics (AIME 2025)Self-reported
40.2%
Humanity's Last Exam
GPT-4.1 mini without tools - Questions expert level by various subjects.Self-reported
3.7%
HMMT 2025
GPT-4.1 mini without tools - Harvard-MIT Mathematics Tournament.Self-reported
35.0%

License & Metadata

License
proprietary
Announcement Date
April 14, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.