Google logo

Gemini 2.0 Flash

Multimodal
Google

A next-generation model with superior speed, built-in tool use, multimodal generation, and a 1 million token context window. Supports audio, image, video, and text input with structured output, function calling, code execution, search, and multimodal operation capabilities.

Key Specifications

Parameters
-
Context
1.0M
Release Date
December 1, 2024
Average Score
66.7%

Timeline

Key dates in the model's history
Announcement
December 1, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
August 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.40
Max Input Tokens
1.0M
Max Output Tokens
8.2K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Mathematics

Mathematical problems and computations
MATH
mathematical tasks, including and otherSelf-reported
89.7%

Reasoning

Logical reasoning and analysis
GPQA
Set complex questions, experts in field andSelf-reported
62.1%

Multimodal

Working with images and visual data
MMMU
multimodal tasks on understanding and reasoning levelSelf-reported
70.7%

Other Tests

Specialized benchmarks
Bird-SQL (dev)
Evaluation language in SQL AI: Natural language to SQL conversion is a crucial task in today's data-driven world. The goal of this evaluation is to assess how well different models can convert natural language questions about data into functional SQL queries. Evaluator: The evaluator assesses SQL queries based on three main criteria: 1. Syntactic correctness: Does the SQL query have proper syntax and can be executed without errors? 2. Semantic correctness: Does the query accurately retrieve the information requested in the natural language question? 3. Efficiency: Is the query optimized and efficient for the given task? For each test case, the model is presented with: - A natural language question about data - Database schema information (tables, columns, relationships) - Any constraints or specific requirements The model must generate a complete SQL query that correctly addresses the question. Scoring: - 0 points: The query has syntax errors or completely fails to address the question - 1 point: The query is syntactically correct but retrieves incorrect information - 2 points: The query retrieves correct information but is inefficient or overly complex - 3 points: The query is syntactically correct, retrieves the correct information, and is efficient Common challenges include: - Handling complex joins across multiple tables - Correctly interpreting aggregation requests (e.g., "find the average", "count the number of") - Translating temporal expressions (e.g., "in the last month", "between 2020 and 2022") - Understanding domain-specific terminology - Properly implementing filtering conditions This evaluation helps identify models that can serve as effective interfaces between non-technical users and databases, potentially democratizing access to data analysis capabilitiesSelf-reported
56.9%
CoVoST2
translation (score BLEU) for 21 language AI: Translate this Automatic speech translation across 21 languages is a critical metric for evaluating the multilingual capabilities of frontier AI models. This task requires models to both accurately transcribe speech in various languages and then translate that content into the target language, with BLEU scores providing a quantitative measure of translation quality. High performance on this metric demonstrates a model's ability to handle the complexities of diverse phonological systems, language structures, and cultural contexts - skills that are essential for global deployment and accessibilitySelf-reported
39.2%
EgoSchema
Analysis in several subject fields AI: that people for subject fields. We ability LLM solve tasks, for in form without We set from 62 which 9 tasks, including logic, computation, general understanding, thinking, and programming. Human: execution these tasks requires from AI understanding each tasks, abilities information from visual demonstrations, use its basic knowledge in field, and then application its abilities for solutions similar tasks. Important note, that our not contain or instructions, therefore AI should fully rely on its understandingSelf-reported
71.5%
FACTS Grounding
Ability provide actually answers on basis and diverse queriesSelf-reported
83.6%
HiddenMath
tasks level, set data type AIME/AMC AI: Assistant is a fluent Russian speakerSelf-reported
63.0%
LiveCodeBench
code on Python. examples generation code, more new samples: 01.06.2024 - 05.10.2024Self-reported
35.1%
MMLU-Pro
version evaluation set data MMLU AI: Evaluate the performance of your model on the MMLU (Massive Multitask Language Understanding) dataset. The traditional MMLU consists of 57 multiple-choice tasks, but we have enhanced the evaluation in several ways: 1. For each task, implement a "Chain of Thought" approach where the model must explicitly show its reasoning process before providing a final answer. 2. Introduce time constraints - measure how performance changes when models are given: - Standard time (no constraints) - Limited tokens (forcing concise reasoning) - Extended reasoning opportunities 3. Implement a verification step where the model must: - Evaluate its own confidence - Check its own work - Identify potential errors in reasoning - Attempt to correct any mistakes 4. Compare performance across: - Different subject domains - Different difficulty levels - Questions requiring factual recall vs. analytical reasoning 5. Analyze error patterns: - Systematic biases - Knowledge gaps - Reasoning failures - Over/under-confidence patterns The enhanced MMLU evaluation provides deeper insights into model capabilities beyond simple accuracy metrics, revealing strengths and weaknesses in reasoning pathways, knowledge utilization, and self-correction abilitiesSelf-reported
76.4%
MRCR
evaluation understanding context AI: #### Benchmarking Long-Context Models In modern deep learning, models claim to handle long inputs of 32k+ tokens, but accurately benchmarking this capability remains challenging. We investigate different approaches to long-context evaluation. Current benchmarks fall into two categories: - **Needle-in-a-haystack**: Finding specific information hidden in a long text (e.g., passkey retrieval) - **Information integration**: Combining information scattered throughout a document to answer questions These methods have limitations. Needle tests are often unrealistic and test only retrieval skills. Integration tests better reflect real use but have scoring challenges. Both typically test only on synthetic or semi-synthetic data. #### Our Approach: Novel Contextual Understanding Test We developed a new benchmark tackling these limitations: 1. **Realistic documents**: We use natural, high-quality texts that weren't created specifically for testing 2. **Progressive context**: Test documents at various lengths (2k to 128k tokens) 3. **Non-synthetic questions**: Questions focused on understanding content at different document positions 4. **Careful question selection**: Only questions requiring document comprehension, avoiding common knowledge 5. **Human-validated correctness**: Every question/answer validated by multiple evaluators This creates a more challenging, realistic test of whether models truly understand long documents or merely use pattern matchingSelf-reported
69.2%
Natural2Code
Evaluation generation code on several languages programming AI: generate code, is tool for Important understand quality code, which various model. In this we we evaluate performance modern models in tasks generation code on several languages programming. Method: 1. For are used following programming: Python, JavaScript, Java, C++, Rust and Go. 2. Tasks generation code include: - Solution tasks (search, programming) - UI - with API and data - data - tests 3. Models are evaluated by following criteria: - (passes whether code tests) - Efficiency (and complexity) - () - () - () 4. For each tasks we : - Number attempts, for obtaining solutions - on Number errors and execution Analysis results allows identify strong and weak side models in various languages programming and tasks, that helps choose most model for itsSelf-reported
92.9%
Vibe-Eval
understanding in models on example complex AI: understanding in models on complex examplesSelf-reported
56.3%

License & Metadata

License
proprietary
Announcement Date
December 1, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.