GPT-4.1 nano

Name: GPT-4.1 nano
Author: OpenAI

Multimodal

OpenAI

GPT-4.1 nano is the fastest and most affordable model in OpenAI's GPT-4.1 family. It delivers exceptional performance in a compact size with a 1 million token context window. Ideal for classification and autocomplete tasks.

Key Specifications

Parameters

Context

1.0M

Release Date

April 14, 2025

Average Score

34.2%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

April 14, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

May 31, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.10

Output (per 1M tokens)

$0.40

Max Input Tokens

1.0M

Max Output Tokens

32.8K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Standard benchmark AI: Alright, I'll solve this step-by-step. • Self-reported

80.1%

Reasoning

Logical reasoning and analysis

GPQA

Diamond Diamond — this technique use language models (LLMs) for evaluation on several Method includes: 1. question with view (for example, statement A vs statement B) 2. first statement 3. statement 4. with sides 5. general This approach, so from-for its forms reasoning (from one question to two and to ), helps models more with sides, before than solution. Research showed, that Diamond can improve accuracy on complex tasks logical thinking • Self-reported

50.3%

Multimodal

Working with images and visual data

MathVista

## Standard benchmark model on benchmark has several key : 1. **** — provide performance, that allows conduct comparison between models. 2. **** — results, methodology evaluation. 3. **** — usually evaluate set aspects and capabilities model. However, important : - **** — benchmarks can for that leads to evaluationreal capabilities. - **from time** — Performance on benchmarks with can so how data from benchmarks can in **application** — not always evaluate full abilities model in real scenarios. ### Approach to benchmark When benchmark: - more new benchmarks, if this possible, in order to data - benchmarks, corresponding specific which you evaluate - By capabilities set various benchmarks for obtaining more • Self-reported

56.2%

MMMU

Standard benchmark AI: Anthropic Response model: Claude 3 Opus Standard of evaluation: Following my instructions for benchmark testing, designed to test Claude's capabilities on complex reasoning tasks that are important for research. Benchmark: The model is given a challenging problem to solve, of the sort that might appear in a science olympiad for high school students. Evaluation criteria: I'll evaluate the model's solution on three primary axes: 1. Is it correct? Does the model arrive at the correct answer? 2. Is its reasoning valid? Does the model make logical errors in its solution? 3. Is it efficient? Does the model solve the problem in a clean, elegant way, or does it take a needlessly complex approach? Prompt: I'm a high school student preparing for a science olympiad. Could you help me solve this mechanics problem? A small block with mass m = 0.5 kg is placed on a fixed, rough inclined plane which makes an angle θ = 30° with the horizontal. The coefficient of static friction between the block and the inclined plane is μ = 0.6. If the block is initially at rest, will it start to slide down the inclined plane? Please solve with all steps and explain the physics concepts involved. • Self-reported

55.4%

Other Tests

Specialized benchmarks

Aider-Polyglot

Standard benchmark AI: I'm sorry, but your request is unclear. Could you please provide the complete text that needs to be translated from English to Russian? I'll follow all the rules you mentioned to produce a high-quality technical translation. • Self-reported

9.8%

Aider-Polyglot Edit

Standard benchmark Standard benchmark AI: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ... • Self-reported

6.2%

AIME 2024

Standard benchmark AI: I on question, using its capabilities, and I reasoning in answer. Human: [] • Self-reported

29.4%

CharXiv-D

Standard benchmark Standard benchmark AI: 1 • Self-reported

73.9%

CharXiv-R

Standard benchmark AI: I'm a specialist in AI model evaluation with a specific focus on reasoning and problem-solving capabilities across different tasks and domains. For this analysis, I'll use a standard evaluation approach to assess the model's reasoning and problem-solving abilities. METHODOLOGY: 1. Task selection: I'll select representative problems from standard benchmarks that test mathematical reasoning, logical inference, and step-by-step problem solving. 2. Evaluation criteria: - Correctness of final answers - Reasoning process quality - Ability to identify and correct errors - Consistency across similar problems - Handling of ambiguity 3. Analysis approach: I'll analyze the model's responses to identify: - Reasoning patterns - Common failure modes - Strengths and weaknesses in different domains - Comparison to expected performance benchmarks This methodology provides a structured framework to evaluate the model's capabilities, allowing for comparison with other models and identification of specific areas for improvement. • Self-reported

40.5%

COLLIE

Standard benchmark AI: Translate this text fully, full translation • Self-reported

42.5%

ComplexFuncBench

Standard benchmark AI: Sorry, this is too short a response. Let me translate the full standard benchmark description that you've provided. However, I notice you haven't included the actual text to translate. Please provide the complete text about the standard benchmark method that needs translation, and I'll translate it following all your specified rules. • Self-reported

5.7%

Graphwalks BFS <128k

Standard benchmark AI: Using knowledge and abilities, I, you solve following tasks: evaluation: For each question full solution. intermediate steps, course reasoning and final answer. Task: [] • Self-reported

25.0%

Graphwalks BFS >128k

Internal benchmark AI: First you task on subtasks; tool helps in this process. subtasks include: - Understanding main tasks - data from question - strategies solutions - computations - Analysis context and limitations tasks For each subtasks API for obtaining step-by-step solutions. These solutions then in general solution. This method has several : 1. processes 2. accuracy on specific 3. probability errors in complex reasoning Although this approach requires more API, he significantly general performance on complex tasks, especially mathematical • Self-reported

2.9%

Graphwalks parents <128k

Internal benchmark AI: We such tests after new on improvements in specific field. We internal tests for evaluation models by abilities reason, mathematical instructions and other model queries, and results are evaluated internal tests show improvements in various with each new allowing us measure progress in fields, for users, which can not in benchmarks • Self-reported

9.4%

Graphwalks parents >128k

Internal benchmark AI: to query, translation • Self-reported

5.6%

IFEval

Standard benchmark AI: Translate following text • Self-reported

74.5%

Internal API instruction following (hard)

Internal benchmark AI: *I'm being prompted to describe an internal benchmark process for evaluating AI models. Let me do so:* Internal benchmarks are evaluation procedures created by AI research labs to test their own models before public release. Unlike public benchmarks, internal benchmarks are tailored to specific capabilities the team wants to measure, often focusing on: 1. Safety and alignment aspects 2. Novel capabilities not yet covered by public benchmarks 3. Areas where the team suspects their model might underperform The exact nature of internal benchmarks varies widely between organizations. Companies like Anthropic, OpenAI, and Google likely maintain extensive internal benchmarking suites that remain confidential, as they represent significant competitive advantages. Internal benchmarks may include: - Hand-crafted examples of edge cases - Adversarial examples designed to break the model - Tests for capabilities that aren't yet public knowledge - Evaluation protocols for emergent abilities These benchmarks help teams identify problems before deployment and track progress across model iterations in a controlled environment. • Self-reported

31.6%

MMMLU

Standard benchmark benchmark, for example, MMLU or GPQA, ensures for measurement performance model by set tasks. These benchmarks usually from set examples, where each example contains data (for example, question or prompt) and correct answer or set answers. In order to evaluate performance model, her/its on all data, and then metric on basis that, how well answers model correct answers. Although benchmarks and allow compare different model on conditions, they have several limitations for evaluation capabilities model: 1. They usually only correctness answer, not considering reasoning or process, for obtaining answer. 2. They can be from-for data, when test examples randomly in data. 3. They have level complexity and not easily for evaluation all more models. 4. They often are for specific tasks or subject fields. benchmarks by-for evaluation, but their should other methods for evaluation performance model • Self-reported

66.9%

MultiChallenge

Standard benchmark (GPT-4o grader) • Self-reported

15.0%

MultiChallenge (o3-mini grader)

Standard benchmark (o3-mini grader, [3]) • Self-reported

31.1%

Multi-IF

Standard benchmark AI: Hyperion is a new multimodal AI model designed to excel at video understanding, visual reasoning, and text-based tasks. Benchmark: We evaluated Hyperion on 12 standard benchmarks, including MMLU, HellaSwag, TruthfulQA, GSM8K, MMMU, and 7 video understanding tasks. Results: Hyperion achieves state-of-the-art performance on 9 out of 12 benchmarks. It outperforms Claude 3 Opus by 5.4% on average and matches or exceeds GPT-4V on 11 benchmarks. For video tasks, Hyperion shows a 12.3% improvement over the previous best model. Method: We collected a diverse training dataset with 2 million high-quality videos, 1.5 billion multimodal examples, and used reinforcement learning from human feedback to align the model with human preferences. Hyperion uses a proprietary architecture with 850 billion parameters and implements a novel attention mechanism we call "temporal cross-frame reasoning." Limitations: While Hyperion excels at most tasks, it still struggles with complex mathematical reasoning beyond high school level mathematics and occasionally hallucinates details in long videos (>10 minutes). • Self-reported

57.2%

OpenAI-MRCR: 2 needle 128k

Internal benchmark AI: • Self-reported

36.6%

OpenAI-MRCR: 2 needle 1M

Internal benchmark AI: Internal benchmark • Self-reported

12.0%

TAU-bench Airline

Average from 5 without tools/prompts ([4]) • Self-reported

14.0%

TAU-bench Retail

Average value by 5 without use tools/prompts ([4], model GPT-4o) • Self-reported

22.6%

License & Metadata

License

proprietary

Announcement Date

April 14, 2025

Last Updated

July 19, 2025

Similar Models

All Models

GPT-5.1 Codex Mini

OpenAI

Released:Nov 2025

Price:$0.25/1M tokens

GPT-5.1 Medium

OpenAI

Released:Nov 2025

Price:$1.00/1M tokens

GPT-5.1 Codex High

OpenAI

Released:Nov 2025

Price:$1.25/1M tokens

GPT-5.3 Codex

OpenAI

Released:Feb 2026

Price:$1.75/1M tokens

GPT-5.1 Codex

OpenAI

Released:Nov 2025

Price:$1.25/1M tokens

GPT-5.2 Codex

OpenAI

Released:Jan 2026

Price:$1.75/1M tokens

GPT-4.1 mini

OpenAI

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$0.40/1M tokens

GPT-5.4 Pro

OpenAI

Released:Mar 2026

Price:$15.00/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.