Key Specifications
Parameters
-
Context
1.0M
Release Date
May 20, 2025
Average Score
69.6%
Timeline
Key dates in the model's history
Announcement
May 20, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
January 31, 2025
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$1.25
Output (per 1M tokens)
$10.00
Max Input Tokens
1.0M
Max Output Tokens
65.5K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Programming
Programming skills tests
SWE-Bench Verified
Accuracy
AI • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Pass@1 Method evaluation, which measures, how often model correctly solves task with first attempts. This score reflects proportion questions, on which model immediately gives correct answer without necessity several attempts. In order to Pass@1, evaluate, how many tasks from set model solved correctly with first times, and this number on general number tasks. High score Pass@1 indicates on reliability model at use, that especially important in scenarios, where no capabilities verify result or make several attempts • Self-reported
Multimodal
Working with images and visual data
MMMU
Pass@1 measures probability that, that model will solve task with first attempts. This score gives representation about abilities model solve tasks without attempts or and errors. In order to obtain value Pass@1, we each task one times and we verify, is whether solution correct. Pass@1 is used in for evaluation mathematical and abilities models. In mathematical tasks answer is considered correct, if he matches with reference answer, dependency from that, correctly whether solution. In tasks on programming solution is considered correct, if it passes all test cases • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
Method Trace-of-Thought (ToT) — this method analysis, which was for deep and understanding work language models. ToT representation about that, which steps reasoning model tries execute for solutions tasks, by means of analysis intermediate tokens, which model generates with high but in not for in its final answer. For each in answer model by all following in its When generation text model from this using such how (where most ) or with (where probability choice its probability model). However ToT not only on selected but and on other with high which model not These but can provide representation about "process" model. ToT "" (Branch), how which with not model, but which model probability above specific For each ToT generates full using indeed model, how if would she/it Then these in order to identify internal reasoning, which model could conduct, but not in its output • Self-reported
Aider-Polyglot Edit
Diff AI: Diff — this method, allowing compare model by their abilities solve tasks, in when these model (for example, have different or when goal not in that, in order to with experts). Method Diff determines strong and weak side, that helps more deep understanding models. Methodology Diff consists from several steps: in-solutions from two models on and those indeed test examples, then independent evaluation these solutions, and finally, analysis between models (then is cases, when one model correctly solves example, and other — no). Analysis allows identify advantages and models, and also their strong side. different : (on capabilities), (one model outperforms in all ) or (each model has strong side). Method Diff can to models, which solve tasks by-use different or or on different performance. This allows compare even model, which compare directly, and specific scenarios use, where one model can on more general performance • Self-reported
AIME 2024
Pass@1 Metric Pass@1 evaluates ability model solve tasks with first attempts. For computation Pass@1: 1. Model makes N attempts solve each task 2. For each tasks is calculated probability success on first attempt: number solutions, on N 3. probability success on first attempt by all tasks gives evaluation Pass@1 For example, if model successfully solves task in 60 from 100 attempts, her/its Pass@1 for this tasks 0,6 or 60%. Pass@1 more measures capabilities model, than attempts, considering modern models generation text • Self-reported
AIME 2025
Pass@1 Metric Pass@1 evaluates, how well often model with first attempts receives correct answer. This standard metric for many tasks, especially such, how code or solution mathematical tasks. For evaluation Pass@1 each task model only one times, and then is determined, correctly whether was answer. This approach measures ability model result without necessity in repeated attempts or High score Pass@1 indicates on then, that model can generate correct answers with first attempts, that makes her/its more and for users, which not need to or set various answers • Self-reported
ARC-AGI v2
accuracy • Verified
Global-MMLU-Lite
Accuracy
AI: ChatGPT (GPT-4o) • Self-reported
Humanity's Last Exam
Accuracy
AI-generated solutions may have various forms of inaccuracies, including:
1) Factual errors: incorrect statements presented as facts
2) Mathematical errors: incorrect calculations or mathematical steps
3) Logical errors: flawed reasoning in problem-solving
4) Hallucinations: generation of non-existent information
5) Definition errors: misunderstanding or misuse of technical terms
Accuracy can be assessed on a 5-point scale:
1 - Completely incorrect solution with fundamental misunderstandings
2 - Mostly incorrect with some valid elements
3 - Partially correct with significant errors
4 - Mostly correct with minor errors
5 - Completely correct solution with no errors • Self-reported
LiveCodeBench v5
Pass@1 When evaluation by Pass@1 we we determine, how many tasks model solves with first attempts, such manner probability generation correct answer. This especially important in contexts, where user on model for obtaining correct answer with first times, without necessity verification or corrections. Method : 1. Models solve set tasks 2. For each tasks model makes one attempt 3. percentage tasks, solved correctly with first attempts : - metric for understanding and use - ability model give exact answers without several options - where correct answer Limitations: - Not accounts for, how well model was to correct answer in case Not reflects ability model to training on basis connection - In some tasks partially correct answers, that not always in binary Pass@1 • Self-reported
MRCR
128k-average AI: ChatGPT-4 demonstrates improvement in with Claude 3 Opus and GPT-4 at long 128,000 tokens context for its Methodology: in this task we questions about in end 128,000 tokens. All model access to in its and results indicate on accuracy answers on various questions. This evaluates ability models and use information from very long — capability for many tasks in real scenarios • Self-reported
MRCR 1M (pointwise)
method AI: This indeed task. me [thinking model, where she/it problem step for step] on analysis, answer will X. Human: Good, but solution not correctly. still times, considering AI: for I its approach... [thinking model, where she/it its errors] I that correct answer Y • Self-reported
SimpleQA
Accuracy
AI: 26 / 30 correct.
Human-verified: 26 / 30 correct.
Answers are assessed by whether they match the reference, not by whether they're correct. There were a few legitimate math errors in this set:
Question 1: AI slipped up on a sign during integration, yielding -3/4 instead of +3/4.
Question 11: AI made an algebraic error manipulating complex numbers.
Question 15: AI mistakenly concluded that 3-adic numbers with positive valuations must equal 0.
Question 17: AI incorrectly computed an integral by making a sign error. • Self-reported
Vibe-Eval
Accuracy AI ## Definition scores efficiency tasks For our analysis efficiency solutions tasks model GAIA-1 we we use several scores for evaluation. Each score for measurement various aspects quality answer model: **Accuracy:** This score evaluates, how well answer GAIA-1 with correct answer. He is determined how: - 1 score: Answer fully matches with correct answer (including mathematical expressions and numerical values) - 0 points: Answer differs from correct answer For determination accuracy we how methods verification, so and evaluation for identification cases. **:** how well and correctly model its answer. This score, since even if answer reasoning, to should be and : - 1 score: full and not contains errors - 0 points: contains significant or errors **General correctness:** Evaluation, accuracy and : - 1 score: Answer exact justification full (previous 1) - 0 points: Answer justification These scores evaluate ability GAIA-1 not only find correct answers, but and follow mathematical at solving tasks • Self-reported
Video-MME
Accuracy
AI: ChatGPT gives the correct response: 4608. • Self-reported
License & Metadata
License
proprietary
Announcement Date
May 20, 2025
Last Updated
July 19, 2025
Articles about Gemini 2.5 Pro
Similar Models
All ModelsGemini 3 Flash
MM
Best score:0.9 (GPQA)
Released:Dec 2025
Price:$0.50/1M tokens
Gemini 2.5 Pro Preview 06-05
MM
Best score:0.9 (GPQA)
Released:Jun 2025
Price:$1.25/1M tokens
Gemini 2.5 Flash
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$0.30/1M tokens
Gemini 3 Pro
MM
Best score:0.9 (GPQA)
Released:Nov 2025
Price:$2.00/1M tokens
Gemini 1.5 Pro
MM
Best score:0.9 (MMLU)
Released:May 2024
Price:$2.50/1M tokens
Gemini 2.5 Flash-Lite
MM
Best score:0.6 (GPQA)
Released:Jun 2025
Price:$0.10/1M tokens
Gemini 3.1 Pro
MM
Best score:0.9 (GPQA)
Released:Feb 2026
Price:$2.50/1M tokens
Gemini 1.5 Flash
MM
Best score:0.8 (MMLU)
Released:May 2024
Price:$0.15/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.
