OpenAI logo

GPT-4.1 nano

Multimodal
OpenAI

GPT-4.1 nano is the fastest and most affordable model in OpenAI's GPT-4.1 family. It delivers exceptional performance in a compact size with a 1 million token context window. Ideal for classification and autocomplete tasks.

Key Specifications

Parameters
-
Context
1.0M
Release Date
April 14, 2025
Average Score
34.2%

Timeline

Key dates in the model's history
Announcement
April 14, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
May 31, 2024
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.40
Max Input Tokens
1.0M
Max Output Tokens
32.8K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Standard benchmark AI: Alright, I'll solve this step-by-step.Self-reported
80.1%

Reasoning

Logical reasoning and analysis
GPQA
Diamond Diamond — this technique use language models (LLMs) for evaluation on several Method includes: 1. question with view (for example, statement A vs statement B) 2. first statement 3. statement 4. with sides 5. general This approach, so from-for its forms reasoning (from one question to two and to ), helps models more with sides, before than solution. Research showed, that Diamond can improve accuracy on complex tasks logical thinkingSelf-reported
50.3%

Multimodal

Working with images and visual data
MathVista
## Standard benchmark model on benchmark has several key : 1. **** — provide performance, that allows conduct comparison between models. 2. **** — results, methodology evaluation. 3. **** — usually evaluate set aspects and capabilities model. However, important : - **** — benchmarks can for that leads to evaluationreal capabilities. - **from time** — Performance on benchmarks with can so how data from benchmarks can in **application** — not always evaluate full abilities model in real scenarios. ### Approach to benchmark When benchmark: - more new benchmarks, if this possible, in order to data - benchmarks, corresponding specific which you evaluate - By capabilities set various benchmarks for obtaining moreSelf-reported
56.2%
MMMU
Standard benchmark AI: Anthropic Response model: Claude 3 Opus Standard of evaluation: Following my instructions for benchmark testing, designed to test Claude's capabilities on complex reasoning tasks that are important for research. Benchmark: The model is given a challenging problem to solve, of the sort that might appear in a science olympiad for high school students. Evaluation criteria: I'll evaluate the model's solution on three primary axes: 1. Is it correct? Does the model arrive at the correct answer? 2. Is its reasoning valid? Does the model make logical errors in its solution? 3. Is it efficient? Does the model solve the problem in a clean, elegant way, or does it take a needlessly complex approach? Prompt: I'm a high school student preparing for a science olympiad. Could you help me solve this mechanics problem? A small block with mass m = 0.5 kg is placed on a fixed, rough inclined plane which makes an angle θ = 30° with the horizontal. The coefficient of static friction between the block and the inclined plane is μ = 0.6. If the block is initially at rest, will it start to slide down the inclined plane? Please solve with all steps and explain the physics concepts involved.Self-reported
55.4%

Other Tests

Specialized benchmarks
Aider-Polyglot
Standard benchmark AI: I'm sorry, but your request is unclear. Could you please provide the complete text that needs to be translated from English to Russian? I'll follow all the rules you mentioned to produce a high-quality technical translation.Self-reported
9.8%
Aider-Polyglot Edit
Standard benchmark Standard benchmark AI: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...Self-reported
6.2%
AIME 2024
Standard benchmark AI: I on question, using its capabilities, and I reasoning in answer. Human: []Self-reported
29.4%
CharXiv-D
Standard benchmark Standard benchmark AI: 1Self-reported
73.9%
CharXiv-R
Standard benchmark AI: I'm a specialist in AI model evaluation with a specific focus on reasoning and problem-solving capabilities across different tasks and domains. For this analysis, I'll use a standard evaluation approach to assess the model's reasoning and problem-solving abilities. METHODOLOGY: 1. Task selection: I'll select representative problems from standard benchmarks that test mathematical reasoning, logical inference, and step-by-step problem solving. 2. Evaluation criteria: - Correctness of final answers - Reasoning process quality - Ability to identify and correct errors - Consistency across similar problems - Handling of ambiguity 3. Analysis approach: I'll analyze the model's responses to identify: - Reasoning patterns - Common failure modes - Strengths and weaknesses in different domains - Comparison to expected performance benchmarks This methodology provides a structured framework to evaluate the model's capabilities, allowing for comparison with other models and identification of specific areas for improvement.Self-reported
40.5%
COLLIE
Standard benchmark AI: Translate this text fully, full translationSelf-reported
42.5%
ComplexFuncBench
Standard benchmark AI: Sorry, this is too short a response. Let me translate the full standard benchmark description that you've provided. However, I notice you haven't included the actual text to translate. Please provide the complete text about the standard benchmark method that needs translation, and I'll translate it following all your specified rules.Self-reported
5.7%
Graphwalks BFS <128k
Standard benchmark AI: Using knowledge and abilities, I, you solve following tasks: evaluation: For each question full solution. intermediate steps, course reasoning and final answer. Task: []Self-reported
25.0%
Graphwalks BFS >128k
Internal benchmark AI: First you task on subtasks; tool helps in this process. subtasks include: - Understanding main tasks - data from question - strategies solutions - computations - Analysis context and limitations tasks For each subtasks API for obtaining step-by-step solutions. These solutions then in general solution. This method has several : 1. processes 2. accuracy on specific 3. probability errors in complex reasoning Although this approach requires more API, he significantly general performance on complex tasks, especially mathematicalSelf-reported
2.9%
Graphwalks parents <128k
Internal benchmark AI: We such tests after new on improvements in specific field. We internal tests for evaluation models by abilities reason, mathematical instructions and other model queries, and results are evaluated internal tests show improvements in various with each new allowing us measure progress in fields, for users, which can not in benchmarksSelf-reported
9.4%
Graphwalks parents >128k
Internal benchmark AI: to query, translationSelf-reported
5.6%
IFEval
Standard benchmark AI: Translate following textSelf-reported
74.5%
Internal API instruction following (hard)
Internal benchmark AI: *I'm being prompted to describe an internal benchmark process for evaluating AI models. Let me do so:* Internal benchmarks are evaluation procedures created by AI research labs to test their own models before public release. Unlike public benchmarks, internal benchmarks are tailored to specific capabilities the team wants to measure, often focusing on: 1. Safety and alignment aspects 2. Novel capabilities not yet covered by public benchmarks 3. Areas where the team suspects their model might underperform The exact nature of internal benchmarks varies widely between organizations. Companies like Anthropic, OpenAI, and Google likely maintain extensive internal benchmarking suites that remain confidential, as they represent significant competitive advantages. Internal benchmarks may include: - Hand-crafted examples of edge cases - Adversarial examples designed to break the model - Tests for capabilities that aren't yet public knowledge - Evaluation protocols for emergent abilities These benchmarks help teams identify problems before deployment and track progress across model iterations in a controlled environment.Self-reported
31.6%
MMMLU
Standard benchmark benchmark, for example, MMLU or GPQA, ensures for measurement performance model by set tasks. These benchmarks usually from set examples, where each example contains data (for example, question or prompt) and correct answer or set answers. In order to evaluate performance model, her/its on all data, and then metric on basis that, how well answers model correct answers. Although benchmarks and allow compare different model on conditions, they have several limitations for evaluation capabilities model: 1. They usually only correctness answer, not considering reasoning or process, for obtaining answer. 2. They can be from-for data, when test examples randomly in data. 3. They have level complexity and not easily for evaluation all more models. 4. They often are for specific tasks or subject fields. benchmarks by-for evaluation, but their should other methods for evaluation performance modelSelf-reported
66.9%
MultiChallenge
Standard benchmark (GPT-4o grader)Self-reported
15.0%
MultiChallenge (o3-mini grader)
Standard benchmark (o3-mini grader, [3])Self-reported
31.1%
Multi-IF
Standard benchmark AI: Hyperion is a new multimodal AI model designed to excel at video understanding, visual reasoning, and text-based tasks. Benchmark: We evaluated Hyperion on 12 standard benchmarks, including MMLU, HellaSwag, TruthfulQA, GSM8K, MMMU, and 7 video understanding tasks. Results: Hyperion achieves state-of-the-art performance on 9 out of 12 benchmarks. It outperforms Claude 3 Opus by 5.4% on average and matches or exceeds GPT-4V on 11 benchmarks. For video tasks, Hyperion shows a 12.3% improvement over the previous best model. Method: We collected a diverse training dataset with 2 million high-quality videos, 1.5 billion multimodal examples, and used reinforcement learning from human feedback to align the model with human preferences. Hyperion uses a proprietary architecture with 850 billion parameters and implements a novel attention mechanism we call "temporal cross-frame reasoning." Limitations: While Hyperion excels at most tasks, it still struggles with complex mathematical reasoning beyond high school level mathematics and occasionally hallucinates details in long videos (>10 minutes).Self-reported
57.2%
OpenAI-MRCR: 2 needle 128k
Internal benchmark AI:Self-reported
36.6%
OpenAI-MRCR: 2 needle 1M
Internal benchmark AI: Internal benchmarkSelf-reported
12.0%
TAU-bench Airline
Average from 5 without tools/prompts ([4])Self-reported
14.0%
TAU-bench Retail
Average value by 5 without use tools/prompts ([4], model GPT-4o)Self-reported
22.6%

License & Metadata

License
proprietary
Announcement Date
April 14, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.