Gemini 2.5 Pro Preview 06-05

Name: Gemini 2.5 Pro Preview 06-05
Author: Google

Multimodal

Google

The latest preview of Google's most advanced Gemini reasoning model, capable of solving complex problems. Built for the agentic era with improved reasoning capabilities, multimodal understanding (text, images, video, audio), and a 1 million token context window. Includes thinking preview, code execution, grounding via Google Search, system instructions, function calling, and controlled generation. Supports up to 3,000 images per request, 45-60 minutes of video, and 8.4 hours of audio.

Key Specifications

Parameters

Context

1.0M

Release Date

June 5, 2025

Average Score

68.8%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

June 5, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

January 31, 2025

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$1.25

Output (per 1M tokens)

$10.00

Max Input Tokens

1.0M

Max Output Tokens

65.5K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

SWE-Bench Verified

attempts but approach for improvements model on complex tasks — ask model make several attempts and choose best answer. this approach in that, that he not requires no/none in model or additional training. several ways several attempts: - Self-consistency: several answers from model and choose that, which total. This approach especially efficient for tasks with correct answer, such how computation or tasks with multiple choice. - evaluation: several answers from model and ask model evaluate each answer, and then choose that, which evaluation. This especially useful for tasks, where no correct answer. - Evaluation probability: probability tokens model for evaluation in answer. Usually more probability indicates on more and exact answer. attempts are method improvements performance on various tasks, including solution problems, which require step-by-step thinking. For example, Wang et al. (2022) that 40 answers model on tasks mathematical computations and choice most answer (self-consistency) accuracy with 78.0% to 94.4%. You easily attempts in its appendix, simply one and indeed task several times and best answer on basis one from • Self-reported

67.2%

Reasoning

Logical reasoning and analysis

GPQA

Single attempt Diamond AI is provided with a single task and is asked to solve it, without prior interaction on other tasks of the same type. This is helpful for isolating capabilities without giving the AI a chance to "warm up" on similar problems. In this approach, a problem is selected that requires complex reasoning and has a well-defined answer. The AI must produce the correct answer on its first and only attempt, without any prior exposure to similar problems in the same conversation. This tests the model's raw capability without the benefit of in-context learning or iterative improvement. This method is especially valuable for assessing capabilities in areas like mathematics, coding, and logical reasoning, where problems can have clearly correct or incorrect answers that don't depend on subjective interpretation. AI: Single attempt Diamond AI receives one task and should solve her/its without preliminary with other tasks that indeed type. This helps capabilities, not AI "" on In this approach task, reasoning and clearly answer. AI should give correct answer with first and attempts, without preliminary with tasks in that indeed This verifies basic abilities model without training or improvements. This method especially for evaluation capabilities in such fields, how mathematics, programming and logical reasoning, where tasks can have explicitly correct or incorrect answers, not from interpretation • Self-reported

86.4%

Multimodal

Working with images and visual data

MMMU

attempt • Self-reported

82.0%

Other Tests

Specialized benchmarks

Aider-Polyglot

# Diff-fenced In its work we we present tool Diff-fenced for analysis output language models (LLM), on measurement "process" model in time answer on question. this tool to models Claude and GPT-4, we that these model questions in various "", that leads to and Diff-fenced consists from two main : 1. **"thoughts"** ("thought fencing"): We model its answer reasoning, between (for example, ```thinking``` and ```/thinking```). Then we we offer model give final answer after these 2. **evaluation**: We we evaluate accuracy reasoning in and answers, and then we analyze differences between them. This methodology allows us various "mode": - ****: and reasoning, and answer — model successfully and to correct answer. - ****: reasoning but answer — model correctly but makes error at answer. - ****: reasoning but answer — model then manner to correct answer, on reasoning. - ****: and reasoning, and answer — model fully not handles with task. These for best understanding various types errors, which make LLM, and can help in more systems evaluation and models • Self-reported

82.2%

AIME 2025

attempt • Self-reported

88.0%

FACTS Grounding

Factual accuracy AI: What are facts? The first requirement of a statement to be factual is that it makes claims that are verifiable, i.e., they can be verified using evidence. For example, the following claims may be evaluated as factual, in principle: - The GDP of the US in 2023 was $27.36 trillion - Pineapples require specific temperature ranges for optimal growth - The cat meowed at the dog in my house yesterday The reason is that there can be evidence for or against each statement. In contrast, the following claims are not verifiable: - Red is the best color - Cats are cuter than dogs - Alborz mountains are majestic Additionally, a statement is factual if it is consistent with our understanding of how the world works. For example, claims like the following, though verifiable in principle, are not factual: - Mercury has a higher melting point than iron - Pineapples were first cultivated on Mars - My cat drove to the mall yesterday Since these kinds of claims do not represent how the world is, a model that makes such claims should not be considered factual or accurate, even though it might be possible, in principle, to find evidence against them. • Self-reported

87.8%

Global-MMLU-Lite

performance AI: For many applications important, in order to LLM could well work on different languages. We Claude in tasks understanding and generation on several most in world languages. testing for this but we give representation about Claude by comparison with other LLM. For evaluation abilities understanding we used MMLU-Multilingual, MMLU with on 10 languages. We discovered, that results Claude 3 Opus in approximately on 10% from on other languages, that approximately matches performance at GPT-4. Claude 3 Sonnet demonstrates performance at work with For evaluation abilities generation we how quality text, so and instructions. Claude 3 Opus and Claude 3 Sonnet instructions approximately well on all languages, which we even when we answers on language, from language question. Quality generation for some languages by comparison with but it sufficiently for majority cases use. Claude 3 Haiku shows quality generation on languages with in training data • Self-reported

89.2%

Humanity's Last Exam

Without tools • Self-reported

21.6%

LiveCodeBench

attempt (1/1/2025-5/1/2025) • Self-reported

69.0%

MRCR v2 (8-needle)

1M pointwise AI: ChatGPT-4 • Self-reported

16.4%

SimpleQA

Factual accuracy AI: text, analysis processes thinking and tool use Factuality - factual accuracy systems AI. Although all systems AI errors, more systems usually demonstrate more high actual accuracy on that. We we evaluate actual accuracy, set from diverse questions, requiring actual knowledge, including: - information - We we evaluate accuracy and answers, and also ability systems its when at no information. systems high actual accuracy in many fields, correctly indicate at necessity and clearly degree its • Self-reported

54.0%

Vibe-Eval

Understanding images AI: I'll analyze the image and provide the following information: 1. What is in the image (objects, people, text, etc.) 2. The image's main theme or purpose 3. Notable details and context 4. Any text content with accurate transcription If the image has charts, diagrams, or technical content, I'll explain what they show. If there's text in another language, I'll translate it when possible. For images showing code, math, or technical diagrams, I'll provide detailed analysis of the content and structure. • Self-reported

67.2%

VideoMMMU

Understanding AI: LLMs possess remarkable proficiency in processing and extracting information from videos, though their capabilities vary based on deployment context. Multimodal models like Claude, GPT-4, and Gemini demonstrate substantial competence in processing video content, but their performance depends on the specific task. Methodological challenges: Video understanding tests should evaluate models on their ability to comprehend dynamic visual elements, track narrative continuity, and integrate audio with visual inputs across multiple frames. The most rigorous evaluation approaches include sequential frame processing and analysis of temporal relationships. Current capabilities: Contemporary models excel at basic scene description, object identification, and activity recognition. They can often track objects across frames and interpret simple narratives or instructional content. Models with more advanced capabilities can comprehend longer sequences, identify cause-effect relationships across time, and integrate audio information with visual content. Emerging capabilities: The frontier of video understanding includes extended reasoning about long-form content, sophisticated comprehension of implicit narratives, contextual comprehension across lengthy time periods, and multimodal integration across sensory inputs. Research insights: Performance varies significantly across applications, with models showing stronger results in high-context domains like instructional videos or sports. Creative applications such as video summarization or highlight identification represent valuable but underdeveloped use cases • Self-reported

83.6%

License & Metadata

License

proprietary

Announcement Date

June 5, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 3 Flash

Google

Best score:0.9 (GPQA)

Released:Dec 2025

Price:$0.50/1M tokens

Gemini 3.1 Pro

Google

Best score:0.9 (GPQA)

Released:Feb 2026

Price:$2.50/1M tokens

Gemini 2.5 Pro

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$1.25/1M tokens

Gemini 2.5 Flash

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.30/1M tokens

Gemini 3 Pro

Google

Best score:0.9 (GPQA)

Released:Nov 2025

Price:$2.00/1M tokens

Gemini 2.0 Flash

Google

Best score:0.6 (GPQA)

Released:Dec 2024

Price:$0.10/1M tokens

Gemini 2.0 Flash-Lite

Google

Best score:0.5 (GPQA)

Released:Feb 2025

Price:$0.07/1M tokens

Gemini 2.5 Flash-Lite

Google

Best score:0.6 (GPQA)

Released:Jun 2025

Price:$0.10/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.