Gemini 2.5 Flash

Name: Gemini 2.5 Flash
Author: Google

Multimodal

Google

A thinking model designed to balance price and performance. It builds on Gemini 2.0 Flash with improved reasoning capabilities, hybrid thinking control, multimodal capabilities (text, image, video, audio input), and a 1 million token input context window.

Key Specifications

Parameters

Context

1.0M

Release Date

May 20, 2025

Average Score

62.5%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

May 20, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

January 31, 2025

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.30

Output (per 1M tokens)

$2.50

Max Input Tokens

1.0M

Max Output Tokens

65.5K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

SWE-Bench Verified

Accuracy AI: [query or question]. Human: [answer on question AI]. When evaluation accuracy you should determine, whether human on query AI correctly and exactly. Answers human can be (for example, only part answer) or include information. goal — determine, is whether information, actually Some queries can be so, in order to human information or In these cases on basis that, how well and answer, since actual accuracy determine If human indicates, that not answer or answer, this should consider how answer (cases, when from-for query information) • Self-reported

60.4%

Reasoning

Logical reasoning and analysis

GPQA

Pass@1 In context models (LLM) "Pass@1" — this metric evaluation performance model at solving tasks programming. Definition: Pass@1 measures proportion tasks programming, which model solves correctly with first attempts, when generates only one solution. : - Models task programming - Model generates one solution - Solution is evaluated on all test cases - Pass@1 = (number tasks, solved with first attempts) / (general number tasks) This metric important, because that she/it evaluates ability model generate solution without necessity several attempts or iterations. High score Pass@1 indicates on then, that model understands tasks and can code with first times. In difference from Pass@k (where k > 1), which allows model generate several solutions and if although would one from them works, Pass@1 is more strict metric, accuracy with first attempts • Self-reported

82.8%

Multimodal

Working with images and visual data

MMMU

Pass@1 In context evaluation efficiency models artificial intelligence at solving complex tasks, especially with and reasoning, "Pass@1" represents itself metric, which determines probability that, that model can correctly solve problem with first attempts. When Pass@1 model provides one solution for each tasks. Then these solutions automatically in order to determine, correctly whether they solve task. correctly solved tasks with first attempts gives value Pass@1. This metric especially for evaluation "" abilities model solve problems without use such methods, how or sample. She/It measures base ability model generate correct solutions without necessity in several attempts or High score Pass@1 indicates on then, that model understanding field and can sequentially generate correct solutions with first attempts, that is important score her/its general efficiency and • Self-reported

79.7%

Other Tests

Specialized benchmarks

Aider-Polyglot

Standard evaluation • Self-reported

61.9%

Aider-Polyglot Edit

Diff-Fenced Method, abilities and limitations language models, requiring reasoning or examples. in depends from obtaining answers on complex questions, which then are used how This by means of "" tasks, which model should execute. we model: 1. First solve more question; 2. answer for question from set generated each from which was with help one and that indeed solutions. We we compute in performance model in these by comparison with when it directly solve task. We this method on example mathematical questions with several options answers and tasks on reasoning, at this showing improvement on 33 for GPT-4 Turbo by comparison with direct approach to such tasks • Self-reported

56.7%

AIME 2024

Pass@1 Metric Pass@1 measures percentage tasks, which model can solve with first attempts. This basic score performance model at without capabilities corrections errors. When evaluation with help Pass@1 model receives one attempt on solution each tasks. Model or correctly solves task (1 score), or no (0 points). Then these points by all tasks for obtaining general Pass@1. Important note, that Pass@1 not accounts for model to correct answer — this strict metric (correctly/incorrectly). She/It also not accounts for reasoning or intermediate steps, for obtaining answer. Although Pass@1 is base metric, she/it not reflects, how well model can improve its answers at several attempts, how this make such metrics, how Pass@k or with • Self-reported

88.0%

AIME 2025

Pass@1 Metric Pass@1 evaluates proportion problems, which model solves with first attempts. This most strict metric, since she/it not provides model capabilities errors or improve its answer several attempts. For computation Pass@1 we we use our majority voting (), which probability that, that one sample from set will This method accounts for correct and answers model on task. if model generates n samples for tasks and k from them Pass@1 is calculated how: Pass@1 = k/n Pass@1 consists in its and : he directly measures ability model give correct answers with first attempts, that has value for many • Self-reported

72.0%

FACTS Grounding

Accuracy AI • Self-reported

85.3%

Global-MMLU-Lite

Accuracy In field mathematics and solutions tasks accuracy has value. She/It measures ability model correct answers on before her questions. Accuracy can evaluate by various : simple verification answer, evaluation intermediate steps in solving tasks, or that, can whether reasoning model lead to to model AI should not only provide correct answers, but and do this with number errors in its computations and In complex mathematical tasks these model should accuracy on many steps reasoning, correctly apply concepts and computational errors and, in to model machine training are evaluated on mathematical benchmarks complexity: from base to tasks level AIME and IMO. These evaluation accuracy understand, how well effectively model can with tasks and where improvements • Self-reported

88.4%

Humanity's Last Exam

Accuracy AI: GPT-4o Human: by texts about models artificial intelligence. human. Translate following text descriptions method analysis model AI on Russian language, rules: 1. on language. 2. all and in form (for example: GPT, LLM, API, AIME, GPQA). (for example: "thinking mode" → "mode thinking", "tools" → "tools"). 3. and 4. descriptions. 5. Not information, only then, that maintaining all details. 6. models (for example "GPT-5 nano", "Claude") on 7. benchmarks and on (for example: "AIME", "FrontierMath", "Harvard-MIT Mathematics Tournament"). 8. should be on text, 9. explanations, quotes or — on ONLY translation • Self-reported

11.0%

LiveCodeBench v5

Pass@1 Pass@1 — this metric, which measures, which percentage tasks model can solve with first attempts. In difference from evaluation "Solve Rate", which evaluates probability solutions tasks after several attempts, Pass@1 measures model in generation correct answer with first times. This metric especially important in scenarios, where users and exact results. She/It reflects ability model solve tasks without necessity attempts, that makes her/its score reliability and accuracy model. In context mathematical and tasks, Pass@1 is strict metric, since requires, in order to answer was fully correct with first attempts. This makes Pass@1 especially for evaluation models in fields, where errors can have More score Pass@1 means, that model more and requires iterations for achievements correct answer, that time and computational • Self-reported

63.9%

MRCR

1M-pointwise This method comparison models, at which for set from diverse assignments and queries we one and that indeed query two various models (for example, Claude 3 Opus and GPT-4) and we ask experts evaluate, which answer better. This method ensures understanding strong and sides models. Advantages this method: - He measures capabilities and gives representation about performance. - He verifies ability models perform very diverse tasks, which can for benchmarks. - He includes real queries users, that makes its more for use. Disadvantages method: - results can be complex from-for data. - Evaluation different queries requires resources. - between models can be minor, that about their • Self-reported

32.0%

SimpleQA

Accuracy AI • Self-reported

26.9%

Vibe-Eval

Accuracy percentage or proportion correct answers, model on set test tasks. This most common score efficiency for many tasks, especially those, which have specific correct answers. Accuracy = (Number correct answers) / (number questions) Advantages: - understanding and interpretation - for tasks with clearly correct answers - Allows directly compare different model or approaches Limitations: - Not accounts for complexity various questions - Not reflects confidence model in its answers - Can be for data - Not gives representations about errors • Self-reported

65.4%

License & Metadata

License

proprietary

Announcement Date

May 20, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 3 Flash

Google

Best score:0.9 (GPQA)

Released:Dec 2025

Price:$0.50/1M tokens

Gemini 1.5 Pro

Google

Best score:0.9 (MMLU)

Released:May 2024

Price:$2.50/1M tokens

Gemini 2.5 Pro

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$1.25/1M tokens

Gemini 2.5 Pro Preview 06-05

Google

Best score:0.9 (GPQA)

Released:Jun 2025

Price:$1.25/1M tokens

Gemini 2.5 Flash-Lite

Google

Best score:0.6 (GPQA)

Released:Jun 2025

Price:$0.10/1M tokens

Gemini 3 Pro

Google

Best score:0.9 (GPQA)

Released:Nov 2025

Price:$2.00/1M tokens

Gemini 3.1 Pro

Google

Best score:0.9 (GPQA)

Released:Feb 2026

Price:$2.50/1M tokens

Gemini 2.0 Flash Thinking

Google

Best score:0.7 (GPQA)

Released:Jan 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.