Gemini 2.0 Flash

Name: Gemini 2.0 Flash
Author: Google

Multimodal

Google

A next-generation model with superior speed, built-in tool use, multimodal generation, and a 1 million token context window. Supports audio, image, video, and text input with structured output, function calling, code execution, search, and multimodal operation capabilities.

Key Specifications

Parameters

Context

1.0M

Release Date

December 1, 2024

Average Score

66.7%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

December 1, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

August 1, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.10

Output (per 1M tokens)

$0.40

Max Input Tokens

1.0M

Max Output Tokens

8.2K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Mathematics

Mathematical problems and computations

MATH

mathematical tasks, including and other • Self-reported

89.7%

Reasoning

Logical reasoning and analysis

GPQA

Set complex questions, experts in field and • Self-reported

62.1%

Multimodal

Working with images and visual data

MMMU

multimodal tasks on understanding and reasoning level • Self-reported

70.7%

Other Tests

Specialized benchmarks

Bird-SQL (dev)

Evaluation language in SQL AI: Natural language to SQL conversion is a crucial task in today's data-driven world. The goal of this evaluation is to assess how well different models can convert natural language questions about data into functional SQL queries. Evaluator: The evaluator assesses SQL queries based on three main criteria: 1. Syntactic correctness: Does the SQL query have proper syntax and can be executed without errors? 2. Semantic correctness: Does the query accurately retrieve the information requested in the natural language question? 3. Efficiency: Is the query optimized and efficient for the given task? For each test case, the model is presented with: - A natural language question about data - Database schema information (tables, columns, relationships) - Any constraints or specific requirements The model must generate a complete SQL query that correctly addresses the question. Scoring: - 0 points: The query has syntax errors or completely fails to address the question - 1 point: The query is syntactically correct but retrieves incorrect information - 2 points: The query retrieves correct information but is inefficient or overly complex - 3 points: The query is syntactically correct, retrieves the correct information, and is efficient Common challenges include: - Handling complex joins across multiple tables - Correctly interpreting aggregation requests (e.g., "find the average", "count the number of") - Translating temporal expressions (e.g., "in the last month", "between 2020 and 2022") - Understanding domain-specific terminology - Properly implementing filtering conditions This evaluation helps identify models that can serve as effective interfaces between non-technical users and databases, potentially democratizing access to data analysis capabilities • Self-reported

56.9%

CoVoST2

translation (score BLEU) for 21 language AI: Translate this Automatic speech translation across 21 languages is a critical metric for evaluating the multilingual capabilities of frontier AI models. This task requires models to both accurately transcribe speech in various languages and then translate that content into the target language, with BLEU scores providing a quantitative measure of translation quality. High performance on this metric demonstrates a model's ability to handle the complexities of diverse phonological systems, language structures, and cultural contexts - skills that are essential for global deployment and accessibility • Self-reported

39.2%

EgoSchema

Analysis in several subject fields AI: that people for subject fields. We ability LLM solve tasks, for in form without We set from 62 which 9 tasks, including logic, computation, general understanding, thinking, and programming. Human: execution these tasks requires from AI understanding each tasks, abilities information from visual demonstrations, use its basic knowledge in field, and then application its abilities for solutions similar tasks. Important note, that our not contain or instructions, therefore AI should fully rely on its understanding • Self-reported

71.5%

FACTS Grounding

Ability provide actually answers on basis and diverse queries • Self-reported

83.6%

HiddenMath

tasks level, set data type AIME/AMC AI: Assistant is a fluent Russian speaker • Self-reported

63.0%

LiveCodeBench

code on Python. examples generation code, more new samples: 01.06.2024 - 05.10.2024 • Self-reported

35.1%

MMLU-Pro

version evaluation set data MMLU AI: Evaluate the performance of your model on the MMLU (Massive Multitask Language Understanding) dataset. The traditional MMLU consists of 57 multiple-choice tasks, but we have enhanced the evaluation in several ways: 1. For each task, implement a "Chain of Thought" approach where the model must explicitly show its reasoning process before providing a final answer. 2. Introduce time constraints - measure how performance changes when models are given: - Standard time (no constraints) - Limited tokens (forcing concise reasoning) - Extended reasoning opportunities 3. Implement a verification step where the model must: - Evaluate its own confidence - Check its own work - Identify potential errors in reasoning - Attempt to correct any mistakes 4. Compare performance across: - Different subject domains - Different difficulty levels - Questions requiring factual recall vs. analytical reasoning 5. Analyze error patterns: - Systematic biases - Knowledge gaps - Reasoning failures - Over/under-confidence patterns The enhanced MMLU evaluation provides deeper insights into model capabilities beyond simple accuracy metrics, revealing strengths and weaknesses in reasoning pathways, knowledge utilization, and self-correction abilities • Self-reported

76.4%

MRCR

evaluation understanding context AI: #### Benchmarking Long-Context Models In modern deep learning, models claim to handle long inputs of 32k+ tokens, but accurately benchmarking this capability remains challenging. We investigate different approaches to long-context evaluation. Current benchmarks fall into two categories: - **Needle-in-a-haystack**: Finding specific information hidden in a long text (e.g., passkey retrieval) - **Information integration**: Combining information scattered throughout a document to answer questions These methods have limitations. Needle tests are often unrealistic and test only retrieval skills. Integration tests better reflect real use but have scoring challenges. Both typically test only on synthetic or semi-synthetic data. #### Our Approach: Novel Contextual Understanding Test We developed a new benchmark tackling these limitations: 1. **Realistic documents**: We use natural, high-quality texts that weren't created specifically for testing 2. **Progressive context**: Test documents at various lengths (2k to 128k tokens) 3. **Non-synthetic questions**: Questions focused on understanding content at different document positions 4. **Careful question selection**: Only questions requiring document comprehension, avoiding common knowledge 5. **Human-validated correctness**: Every question/answer validated by multiple evaluators This creates a more challenging, realistic test of whether models truly understand long documents or merely use pattern matching • Self-reported

69.2%

Natural2Code

Evaluation generation code on several languages programming AI: generate code, is tool for Important understand quality code, which various model. In this we we evaluate performance modern models in tasks generation code on several languages programming. Method: 1. For are used following programming: Python, JavaScript, Java, C++, Rust and Go. 2. Tasks generation code include: - Solution tasks (search, programming) - UI - with API and data - data - tests 3. Models are evaluated by following criteria: - (passes whether code tests) - Efficiency (and complexity) - () - () - () 4. For each tasks we : - Number attempts, for obtaining solutions - on Number errors and execution Analysis results allows identify strong and weak side models in various languages programming and tasks, that helps choose most model for its • Self-reported

92.9%

Vibe-Eval

understanding in models on example complex AI: understanding in models on complex examples • Self-reported

56.3%

License & Metadata

License

proprietary

Announcement Date

December 1, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 1.5 Flash

Google

Best score:0.8 (MMLU)

Released:May 2024

Price:$0.15/1M tokens

Gemini 2.0 Flash Thinking

Google

Best score:0.7 (GPQA)

Released:Jan 2025

Gemini 2.5 Pro

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$1.25/1M tokens

Gemini 2.5 Pro Preview 06-05

Google

Best score:0.9 (GPQA)

Released:Jun 2025

Price:$1.25/1M tokens

Gemma 3n E4B

Google

MM8.0B

Best score:0.6 (ARC)

Released:Jun 2025

Gemini 2.5 Flash

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.30/1M tokens

Gemini 2.0 Flash-Lite

Google

Best score:0.5 (GPQA)

Released:Feb 2025

Price:$0.07/1M tokens

Gemma 3n E4B Instructed LiteRT Preview

Google

MM1.9B

Best score:0.8 (HumanEval)

Released:May 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.