Google logo

Gemini 2.5 Pro

Multimodal
Google

Our most intelligent AI model, built for the agentic era. Gemini 2.5 Pro leads on widely-used benchmarks with improved reasoning capabilities, multimodal abilities (text, image, video, audio input), and a 1 million token context window.

Key Specifications

Parameters
-
Context
1.0M
Release Date
May 20, 2025
Average Score
69.6%

Timeline

Key dates in the model's history
Announcement
May 20, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
January 31, 2025
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$1.25
Output (per 1M tokens)
$10.00
Max Input Tokens
1.0M
Max Output Tokens
65.5K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests
SWE-Bench Verified
Accuracy AISelf-reported
63.2%

Reasoning

Logical reasoning and analysis
GPQA
Pass@1 Method evaluation, which measures, how often model correctly solves task with first attempts. This score reflects proportion questions, on which model immediately gives correct answer without necessity several attempts. In order to Pass@1, evaluate, how many tasks from set model solved correctly with first times, and this number on general number tasks. High score Pass@1 indicates on reliability model at use, that especially important in scenarios, where no capabilities verify result or make several attemptsSelf-reported
83.0%

Multimodal

Working with images and visual data
MMMU
Pass@1 measures probability that, that model will solve task with first attempts. This score gives representation about abilities model solve tasks without attempts or and errors. In order to obtain value Pass@1, we each task one times and we verify, is whether solution correct. Pass@1 is used in for evaluation mathematical and abilities models. In mathematical tasks answer is considered correct, if he matches with reference answer, dependency from that, correctly whether solution. In tasks on programming solution is considered correct, if it passes all test casesSelf-reported
79.6%

Other Tests

Specialized benchmarks
Aider-Polyglot
Method Trace-of-Thought (ToT) — this method analysis, which was for deep and understanding work language models. ToT representation about that, which steps reasoning model tries execute for solutions tasks, by means of analysis intermediate tokens, which model generates with high but in not for in its final answer. For each in answer model by all following in its When generation text model from this using such how (where most ) or with (where probability choice its probability model). However ToT not only on selected but and on other with high which model not These but can provide representation about "process" model. ToT "" (Branch), how which with not model, but which model probability above specific For each ToT generates full using indeed model, how if would she/it Then these in order to identify internal reasoning, which model could conduct, but not in its outputSelf-reported
76.5%
Aider-Polyglot Edit
Diff AI: Diff — this method, allowing compare model by their abilities solve tasks, in when these model (for example, have different or when goal not in that, in order to with experts). Method Diff determines strong and weak side, that helps more deep understanding models. Methodology Diff consists from several steps: in-solutions from two models on and those indeed test examples, then independent evaluation these solutions, and finally, analysis between models (then is cases, when one model correctly solves example, and other — no). Analysis allows identify advantages and models, and also their strong side. different : (on capabilities), (one model outperforms in all ) or (each model has strong side). Method Diff can to models, which solve tasks by-use different or or on different performance. This allows compare even model, which compare directly, and specific scenarios use, where one model can on more general performanceSelf-reported
72.7%
AIME 2024
Pass@1 Metric Pass@1 evaluates ability model solve tasks with first attempts. For computation Pass@1: 1. Model makes N attempts solve each task 2. For each tasks is calculated probability success on first attempt: number solutions, on N 3. probability success on first attempt by all tasks gives evaluation Pass@1 For example, if model successfully solves task in 60 from 100 attempts, her/its Pass@1 for this tasks 0,6 or 60%. Pass@1 more measures capabilities model, than attempts, considering modern models generation textSelf-reported
92.0%
AIME 2025
Pass@1 Metric Pass@1 evaluates, how well often model with first attempts receives correct answer. This standard metric for many tasks, especially such, how code or solution mathematical tasks. For evaluation Pass@1 each task model only one times, and then is determined, correctly whether was answer. This approach measures ability model result without necessity in repeated attempts or High score Pass@1 indicates on then, that model can generate correct answers with first attempts, that makes her/its more and for users, which not need to or set various answersSelf-reported
83.0%
ARC-AGI v2
accuracyVerified
4.9%
Global-MMLU-Lite
Accuracy AI: ChatGPT (GPT-4o)Self-reported
88.6%
Humanity's Last Exam
Accuracy AI-generated solutions may have various forms of inaccuracies, including: 1) Factual errors: incorrect statements presented as facts 2) Mathematical errors: incorrect calculations or mathematical steps 3) Logical errors: flawed reasoning in problem-solving 4) Hallucinations: generation of non-existent information 5) Definition errors: misunderstanding or misuse of technical terms Accuracy can be assessed on a 5-point scale: 1 - Completely incorrect solution with fundamental misunderstandings 2 - Mostly incorrect with some valid elements 3 - Partially correct with significant errors 4 - Mostly correct with minor errors 5 - Completely correct solution with no errorsSelf-reported
17.8%
LiveCodeBench v5
Pass@1 When evaluation by Pass@1 we we determine, how many tasks model solves with first attempts, such manner probability generation correct answer. This especially important in contexts, where user on model for obtaining correct answer with first times, without necessity verification or corrections. Method : 1. Models solve set tasks 2. For each tasks model makes one attempt 3. percentage tasks, solved correctly with first attempts : - metric for understanding and use - ability model give exact answers without several options - where correct answer Limitations: - Not accounts for, how well model was to correct answer in case Not reflects ability model to training on basis connection - In some tasks partially correct answers, that not always in binary Pass@1Self-reported
75.6%
MRCR
128k-average AI: ChatGPT-4 demonstrates improvement in with Claude 3 Opus and GPT-4 at long 128,000 tokens context for its Methodology: in this task we questions about in end 128,000 tokens. All model access to in its and results indicate on accuracy answers on various questions. This evaluates ability models and use information from very long — capability for many tasks in real scenariosSelf-reported
93.0%
MRCR 1M (pointwise)
method AI: This indeed task. me [thinking model, where she/it problem step for step] on analysis, answer will X. Human: Good, but solution not correctly. still times, considering AI: for I its approach... [thinking model, where she/it its errors] I that correct answer YSelf-reported
82.9%
SimpleQA
Accuracy AI: 26 / 30 correct. Human-verified: 26 / 30 correct. Answers are assessed by whether they match the reference, not by whether they're correct. There were a few legitimate math errors in this set: Question 1: AI slipped up on a sign during integration, yielding -3/4 instead of +3/4. Question 11: AI made an algebraic error manipulating complex numbers. Question 15: AI mistakenly concluded that 3-adic numbers with positive valuations must equal 0. Question 17: AI incorrectly computed an integral by making a sign error.Self-reported
50.8%
Vibe-Eval
Accuracy AI ## Definition scores efficiency tasks For our analysis efficiency solutions tasks model GAIA-1 we we use several scores for evaluation. Each score for measurement various aspects quality answer model: **Accuracy:** This score evaluates, how well answer GAIA-1 with correct answer. He is determined how: - 1 score: Answer fully matches with correct answer (including mathematical expressions and numerical values) - 0 points: Answer differs from correct answer For determination accuracy we how methods verification, so and evaluation for identification cases. **:** how well and correctly model its answer. This score, since even if answer reasoning, to should be and : - 1 score: full and not contains errors - 0 points: contains significant or errors **General correctness:** Evaluation, accuracy and : - 1 score: Answer exact justification full (previous 1) - 0 points: Answer justification These scores evaluate ability GAIA-1 not only find correct answers, but and follow mathematical at solving tasksSelf-reported
65.6%
Video-MME
Accuracy AI: ChatGPT gives the correct response: 4608.Self-reported
84.8%

License & Metadata

License
proprietary
Announcement Date
May 20, 2025
Last Updated
July 19, 2025

Articles about Gemini 2.5 Pro

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.