Gemini 2.5 Pro

Name: Gemini 2.5 Pro
Author: Google

Multimodal

Google

Our most intelligent AI model, built for the agentic era. Gemini 2.5 Pro leads on widely-used benchmarks with improved reasoning capabilities, multimodal abilities (text, image, video, audio input), and a 1 million token context window.

Key Specifications

Parameters

Context

1.0M

Release Date

May 20, 2025

Average Score

69.6%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

May 20, 2025

Last Update

July 19, 2025

Today

March 25, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

January 31, 2025

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$1.25

Output (per 1M tokens)

$10.00

Max Input Tokens

1.0M

Max Output Tokens

65.5K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

SWE-Bench Verified

Accuracy AI • Self-reported

63.2%

Reasoning

Logical reasoning and analysis

GPQA

Pass@1 Method evaluation, which measures, how often model correctly solves task with first attempts. This score reflects proportion questions, on which model immediately gives correct answer without necessity several attempts. In order to Pass@1, evaluate, how many tasks from set model solved correctly with first times, and this number on general number tasks. High score Pass@1 indicates on reliability model at use, that especially important in scenarios, where no capabilities verify result or make several attempts • Self-reported

83.0%

Multimodal

Working with images and visual data

MMMU

Pass@1 measures probability that, that model will solve task with first attempts. This score gives representation about abilities model solve tasks without attempts or and errors. In order to obtain value Pass@1, we each task one times and we verify, is whether solution correct. Pass@1 is used in for evaluation mathematical and abilities models. In mathematical tasks answer is considered correct, if he matches with reference answer, dependency from that, correctly whether solution. In tasks on programming solution is considered correct, if it passes all test cases • Self-reported

79.6%

Other Tests

Specialized benchmarks

Aider-Polyglot

Method Trace-of-Thought (ToT) — this method analysis, which was for deep and understanding work language models. ToT representation about that, which steps reasoning model tries execute for solutions tasks, by means of analysis intermediate tokens, which model generates with high but in not for in its final answer. For each in answer model by all following in its When generation text model from this using such how (where most ) or with (where probability choice its probability model). However ToT not only on selected but and on other with high which model not These but can provide representation about "process" model. ToT "" (Branch), how which with not model, but which model probability above specific For each ToT generates full using indeed model, how if would she/it Then these in order to identify internal reasoning, which model could conduct, but not in its output • Self-reported

76.5%

Aider-Polyglot Edit

Diff AI: Diff — this method, allowing compare model by their abilities solve tasks, in when these model (for example, have different or when goal not in that, in order to with experts). Method Diff determines strong and weak side, that helps more deep understanding models. Methodology Diff consists from several steps: in-solutions from two models on and those indeed test examples, then independent evaluation these solutions, and finally, analysis between models (then is cases, when one model correctly solves example, and other — no). Analysis allows identify advantages and models, and also their strong side. different : (on capabilities), (one model outperforms in all ) or (each model has strong side). Method Diff can to models, which solve tasks by-use different or or on different performance. This allows compare even model, which compare directly, and specific scenarios use, where one model can on more general performance • Self-reported

72.7%

AIME 2024

Pass@1 Metric Pass@1 evaluates ability model solve tasks with first attempts. For computation Pass@1: 1. Model makes N attempts solve each task 2. For each tasks is calculated probability success on first attempt: number solutions, on N 3. probability success on first attempt by all tasks gives evaluation Pass@1 For example, if model successfully solves task in 60 from 100 attempts, her/its Pass@1 for this tasks 0,6 or 60%. Pass@1 more measures capabilities model, than attempts, considering modern models generation text • Self-reported

92.0%

AIME 2025

Pass@1 Metric Pass@1 evaluates, how well often model with first attempts receives correct answer. This standard metric for many tasks, especially such, how code or solution mathematical tasks. For evaluation Pass@1 each task model only one times, and then is determined, correctly whether was answer. This approach measures ability model result without necessity in repeated attempts or High score Pass@1 indicates on then, that model can generate correct answers with first attempts, that makes her/its more and for users, which not need to or set various answers • Self-reported

83.0%

ARC-AGI v2

accuracy • Verified

4.9%

Global-MMLU-Lite

Accuracy AI: ChatGPT (GPT-4o) • Self-reported

88.6%

Humanity's Last Exam

Accuracy AI-generated solutions may have various forms of inaccuracies, including: 1) Factual errors: incorrect statements presented as facts 2) Mathematical errors: incorrect calculations or mathematical steps 3) Logical errors: flawed reasoning in problem-solving 4) Hallucinations: generation of non-existent information 5) Definition errors: misunderstanding or misuse of technical terms Accuracy can be assessed on a 5-point scale: 1 - Completely incorrect solution with fundamental misunderstandings 2 - Mostly incorrect with some valid elements 3 - Partially correct with significant errors 4 - Mostly correct with minor errors 5 - Completely correct solution with no errors • Self-reported

17.8%

LiveCodeBench v5

Pass@1 When evaluation by Pass@1 we we determine, how many tasks model solves with first attempts, such manner probability generation correct answer. This especially important in contexts, where user on model for obtaining correct answer with first times, without necessity verification or corrections. Method : 1. Models solve set tasks 2. For each tasks model makes one attempt 3. percentage tasks, solved correctly with first attempts : - metric for understanding and use - ability model give exact answers without several options - where correct answer Limitations: - Not accounts for, how well model was to correct answer in case Not reflects ability model to training on basis connection - In some tasks partially correct answers, that not always in binary Pass@1 • Self-reported

75.6%

MRCR

128k-average AI: ChatGPT-4 demonstrates improvement in with Claude 3 Opus and GPT-4 at long 128,000 tokens context for its Methodology: in this task we questions about in end 128,000 tokens. All model access to in its and results indicate on accuracy answers on various questions. This evaluates ability models and use information from very long — capability for many tasks in real scenarios • Self-reported

93.0%

MRCR 1M (pointwise)

method AI: This indeed task. me [thinking model, where she/it problem step for step] on analysis, answer will X. Human: Good, but solution not correctly. still times, considering AI: for I its approach... [thinking model, where she/it its errors] I that correct answer Y • Self-reported

82.9%

SimpleQA

Accuracy AI: 26 / 30 correct. Human-verified: 26 / 30 correct. Answers are assessed by whether they match the reference, not by whether they're correct. There were a few legitimate math errors in this set: Question 1: AI slipped up on a sign during integration, yielding -3/4 instead of +3/4. Question 11: AI made an algebraic error manipulating complex numbers. Question 15: AI mistakenly concluded that 3-adic numbers with positive valuations must equal 0. Question 17: AI incorrectly computed an integral by making a sign error. • Self-reported

50.8%

Vibe-Eval

Accuracy AI ## Definition scores efficiency tasks For our analysis efficiency solutions tasks model GAIA-1 we we use several scores for evaluation. Each score for measurement various aspects quality answer model: **Accuracy:** This score evaluates, how well answer GAIA-1 with correct answer. He is determined how: - 1 score: Answer fully matches with correct answer (including mathematical expressions and numerical values) - 0 points: Answer differs from correct answer For determination accuracy we how methods verification, so and evaluation for identification cases. **:** how well and correctly model its answer. This score, since even if answer reasoning, to should be and : - 1 score: full and not contains errors - 0 points: contains significant or errors **General correctness:** Evaluation, accuracy and : - 1 score: Answer exact justification full (previous 1) - 0 points: Answer justification These scores evaluate ability GAIA-1 not only find correct answers, but and follow mathematical at solving tasks • Self-reported

65.6%

Video-MME

Accuracy AI: ChatGPT gives the correct response: 4608. • Self-reported

84.8%

License & Metadata

License

proprietary

Announcement Date

May 20, 2025

Last Updated

July 19, 2025

Articles about Gemini 2.5 Pro

Google Handles a Billion Health Questions a Day. Should We Be Worried?

At The Check Up 2026, Google revealed the scale of its health AI ambitions — from Fitbit medical records to breast cancer screening. But controversies linger.

March 23, 2026

8 min

Similar Models

All Models

Gemini 3 Flash

Google

Best score:0.9 (GPQA)

Released:Dec 2025

Price:$0.50/1M tokens

Gemini 2.5 Pro Preview 06-05

Google

Best score:0.9 (GPQA)

Released:Jun 2025

Price:$1.25/1M tokens

Gemini 2.5 Flash

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.30/1M tokens

Gemini 3 Pro

Google

Best score:0.9 (GPQA)

Released:Nov 2025

Price:$2.00/1M tokens

Gemini 1.5 Pro

Google

Best score:0.9 (MMLU)

Released:May 2024

Price:$2.50/1M tokens

Gemini 2.5 Flash-Lite

Google

Best score:0.6 (GPQA)

Released:Jun 2025

Price:$0.10/1M tokens

Gemini 3.1 Pro

Google

Best score:0.9 (GPQA)

Released:Feb 2026

Price:$2.50/1M tokens

Gemini 1.5 Flash

Google

Best score:0.8 (MMLU)

Released:May 2024

Price:$0.15/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.