GPT-4.1 mini

Name: GPT-4.1 mini
Author: OpenAI

Multimodal

OpenAI

GPT-4.1 mini balances intelligence, speed, and cost. It represents a significant breakthrough in small model performance, even surpassing GPT-4o on many benchmarks while reducing latency and cost.

Key Specifications

Parameters

Context

1.0M

Release Date

April 14, 2025

Average Score

49.6%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

April 14, 2025

Last Update

July 19, 2025

Today

March 25, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

May 31, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.40

Output (per 1M tokens)

$1.60

Max Input Tokens

1.0M

Max Output Tokens

32.8K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Standard benchmark AI: whether more complex this tasks? • Self-reported

87.5%

Programming

Programming skills tests

SWE-Bench Verified

methodology, on [2] • Self-reported

23.6%

Reasoning

Logical reasoning and analysis

GPQA

Diamond This method uses chain prompts (template prompts), which model first generates several possible solutions tasks, and then their with or errors in reasoning. Diamond-method consists from steps: 1. Model first tries solve task, sequentially various approaches. 2. Then model generates description several (from 3 to 5) various approaches to solving. 3. After this model each approach, possible errors or problems. 4. Model to task and critically its solutions, each on errors. 5. model generates solution, aspects previous approaches and errors. Method Diamond efficient for complex tasks, requiring analysis and verification so how he helps model on one approach and more solutions • Self-reported

65.0%

Multimodal

Working with images and visual data

MathVista

Standard benchmark AI: Human_Evaluation • Self-reported

73.1%

MMMU

Standard benchmark AI: I in field artificial intelligence and language models. includes understanding and evaluation models, such how GPT-4, GPT-3.5 and other LLM. I model, her/its abilities in tasks logical thinking, knowledge facts and processing limitations context. For this analysis I model questions and tasks. I with basic questions, in order to evaluate understanding and instructions, then to more complex tasks, reasoning and abilities follow instructions. I also how model handles with tasks, for which at context or knowledge, in order to evaluate her/its about own • Self-reported

72.7%

Other Tests

Specialized benchmarks

Aider-Polyglot

Standard benchmark AI: Gemma and Claude are good at handling common questions in typical AI benchmarks. But users in real-world situations often have different needs than what these benchmarks test. Real-world user: GPT is easily outperforming Gemma and Claude on real tasks that involve creative thinking, data analysis, and complex reasoning. The benchmark results don't match my experience. • Self-reported

34.7%

Aider-Polyglot Edit

Standard benchmark AI: (1/8) Standard benchmark • Self-reported

31.6%

AIME 2024

Standard benchmark AI: I'll solve this step-by-step. • Self-reported

49.6%

CharXiv-D

Standard benchmark AI: on this task. First let's solve her/its method. Task: $x$ such, that $2^x = 32$. In order to find $x$, I I can use $2^x = 32$ $2^x = 2^5$ (so how $32 = 2^5$) should scores, therefore $x = 5$. Answer: $x = 5$ • Self-reported

88.4%

CharXiv-R

Standard benchmark Evaluation model on set in advance tasks, usually various aspects intelligence. This most common method evaluation and comparison models AI. Advantages: • Models are evaluated on tasks, that ensures results • Allows track progress with time • Can be Disadvantages: • Models can be or under specific benchmarks • often not fully real tasks and • Some benchmarks "" - their tasks and answers data Examples: MMLU (understanding), GSM8K (mathematical tasks in several steps), HumanEval (generation code), GPQA (questions by level), FrontierMath • Self-reported

56.8%

COLLIE

Standard benchmark AI: (thinking) Standard benchmark = standard benchmark. "benchmark" without translation, so how this in AI • Self-reported

54.6%

ComplexFuncBench

Standard benchmark AI: I'm evaluating this model's performance on standard benchmark problems. These tasks are commonly used to compare models and have established performance metrics. 1. I'll identify which well-known benchmarks are appropriate for testing this model 2. I'll examine the model's performance on these benchmarks compared to other systems 3. I'll note any particular strengths or weaknesses revealed by benchmark performance 4. I'll check if the model shows unusual patterns that might indicate memorization of benchmark data Standard benchmarks help establish a baseline for comparison across models, though they have limitations in measuring real-world capabilities. • Self-reported

49.3%

Graphwalks BFS <128k

Standard benchmark Standard benchmark AI • Self-reported

61.7%

Graphwalks BFS >128k

Internal benchmark AI: When benchmark several human for verification quality answers different one API, for example Claude or GPT-4, and evaluation that, which from them outperforms They also evaluate, by problems and For example, at new base model could would use internal benchmark, in order to understand, well whether she/it works at specific questions. benchmarks often have specific in order to results evaluations were and although evaluation rules (explicitly in evaluation) • Self-reported

15.0%

Graphwalks parents <128k

Internal benchmark AI: model for evaluation other models : 1. model AI (for example, Claude 3, GPT-4) 2. system evaluation, which uses this model for analysis conclusions other models 3. model by set complex questions 4. results between models : - evaluation - evaluation set models in large ability on new questions and : - on that model is "" - Can existing and model can be to models with or training Examples: - GPQA uses GPT-4 for evaluation answers other models on scientific questions complexity - Anthropic uses its more model for evaluation performance its models - Microsoft uses internal systems evaluation GPT for comparison various its models • Self-reported

60.5%

Graphwalks parents >128k

Internal benchmark AI: Internal benchmarks are ongoing comparative evaluations of different versions of a system, conducted using a standardized dataset or task. They help track improvements during development, diagnose weaknesses, and ensure that new iterations don't regress on previously solved problems. Unlike external benchmarks, they're typically not published but used internally to guide development. Internal benchmarks can take many forms, from tests of factual knowledge, to evaluations of complex reasoning chains, to measurement of undesirable behaviors like hallucination or bias. They differ from external benchmarks in that they can be more closely tailored to specific capabilities the team wants to develop, can include proprietary data, and can be updated frequently in response to new information. Because LLMs can memorize their training data, it's important to ensure that internal benchmarks aren't leaked into training data. Otherwise, a model might perform well on the benchmark while failing to generalize to similar but previously unseen problems. • Self-reported

11.0%

IFEval

Standard benchmark AI: (then task from standard set tasks for evaluation model) : Model provides its solution problems. this main answer without thinking aloud. with : Model receives indeed task, but her/its ask think step for step, verify its solution and possible errors. This allows us evaluate, improves whether performance model. Comparison with experts: We we compare solution model with reference answer or evaluationexperts, in order to determine accuracy and quality. and evaluation: We we evaluate not only correctness answer, but and logic thinking, errors and ability to • Self-reported

84.1%

Internal API instruction following (hard)

Internal benchmark AI: *internal thoughts* This too text for translation. I its exactly Internal benchmark • Self-reported

45.1%

MMMLU

Standard benchmark AI: (7 of 25 marks) • Self-reported

78.5%

MultiChallenge

Standard benchmark (GPT-4o grader) AI: GPT-4o In this benchmark we we use GPT-4o, which is considered one from most models, for evaluation work models by scale from 1 to 10. consists in that, in order to determine, which model better total with various tasks. We in order to evaluate efficiency all models by various criteria. model tasks, and GPT-4o answers not which model specific answer. We that such approach ensures evaluation quality generation for each model. Methodology: 1. We set complex tasks, requiring thinking level, in order to verify capabilities models. 2. Each model those indeed most tasks. 3. GPT-4o answers models by 10-scale, using criteria evaluation. 4. GPT-4o not information about that, which model was each answer. We we consider, that although such approach to evaluation not he ensures comparison capabilities models at solving complex tasks • Self-reported

35.8%

MultiChallenge (o3-mini grader)

Standard benchmark (o3-mini, [3]) • Self-reported

42.2%

Multi-IF

Standard benchmark AI: I'm ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. • Self-reported

67.0%

OpenAI-MRCR: 2 needle 128k

Internal benchmark AI: (GPT-4o, ChatGPT) User: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. We asked LLMs to evaluate their capabilities in a structured format. The goal is to extract as detailed and accurate information as possible about the model's objective capabilities without relying on the model to provide accurate self-assessments • Self-reported

47.2%

OpenAI-MRCR: 2 needle 1M

Internal benchmark AI: All people that 2 + 2 = 4. they use this in different cases. If 2 apples and 2 apples he 4 apples for two Human: All people that 2 + 2 = 4. they use this in various situations. If 2 apples and 2 apples he 4 apples for two During human then indeed most, that and AI in but makes this by-these two we we can understand between text, AI, and text, benchmarks from benchmarks, since they allow us directly compare various one and that indeed statements • Self-reported

33.3%

TAU-bench Airline

Average by 5 without use special tools/prompts ([4]) • Self-reported

36.0%

TAU-bench Retail

Average value by 5 without tools/prompts (note [4], model GPT-4o) • Self-reported

55.8%

AIME 2025

GPT-4.1 mini without tools - mathematics (AIME 2025) • Self-reported

40.2%

Humanity's Last Exam

GPT-4.1 mini without tools - Questions expert level by various subjects. • Self-reported

3.7%

HMMT 2025

GPT-4.1 mini without tools - Harvard-MIT Mathematics Tournament. • Self-reported

35.0%

License & Metadata

License

proprietary

Announcement Date

April 14, 2025

Last Updated

July 19, 2025

Similar Models

All Models

o4-mini

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$1.10/1M tokens

GPT-5.3 Codex

OpenAI

Released:Feb 2026

Price:$1.75/1M tokens

GPT-5.1 Codex

OpenAI

Released:Nov 2025

Price:$1.25/1M tokens

GPT-5.2 Codex

OpenAI

Released:Jan 2026

Price:$1.75/1M tokens

GPT-5.1 Codex Mini

OpenAI

Released:Nov 2025

Price:$0.25/1M tokens

GPT-4.1 nano

OpenAI

Best score:0.8 (MMLU)

Released:Apr 2025

Price:$0.10/1M tokens

GPT-5.4 Pro

OpenAI

Released:Mar 2026

Price:$15.00/1M tokens

o3-pro

OpenAI

Released:Jun 2025

Price:$20.00/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.