Gemini 2.0 Flash
MultimodalA next-generation model with superior speed, built-in tool use, multimodal generation, and a 1 million token context window. Supports audio, image, video, and text input with structured output, function calling, code execution, search, and multimodal operation capabilities.
Key Specifications
Parameters
-
Context
1.0M
Release Date
December 1, 2024
Average Score
66.7%
Timeline
Key dates in the model's history
Announcement
December 1, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
August 1, 2024
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.40
Max Input Tokens
1.0M
Max Output Tokens
8.2K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Mathematics
Mathematical problems and computations
MATH
mathematical tasks, including and other • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Set complex questions, experts in field and • Self-reported
Multimodal
Working with images and visual data
MMMU
multimodal tasks on understanding and reasoning level • Self-reported
Other Tests
Specialized benchmarks
Bird-SQL (dev)
Evaluation language in SQL AI: Natural language to SQL conversion is a crucial task in today's data-driven world. The goal of this evaluation is to assess how well different models can convert natural language questions about data into functional SQL queries. Evaluator: The evaluator assesses SQL queries based on three main criteria: 1. Syntactic correctness: Does the SQL query have proper syntax and can be executed without errors? 2. Semantic correctness: Does the query accurately retrieve the information requested in the natural language question? 3. Efficiency: Is the query optimized and efficient for the given task? For each test case, the model is presented with: - A natural language question about data - Database schema information (tables, columns, relationships) - Any constraints or specific requirements The model must generate a complete SQL query that correctly addresses the question. Scoring: - 0 points: The query has syntax errors or completely fails to address the question - 1 point: The query is syntactically correct but retrieves incorrect information - 2 points: The query retrieves correct information but is inefficient or overly complex - 3 points: The query is syntactically correct, retrieves the correct information, and is efficient Common challenges include: - Handling complex joins across multiple tables - Correctly interpreting aggregation requests (e.g., "find the average", "count the number of") - Translating temporal expressions (e.g., "in the last month", "between 2020 and 2022") - Understanding domain-specific terminology - Properly implementing filtering conditions This evaluation helps identify models that can serve as effective interfaces between non-technical users and databases, potentially democratizing access to data analysis capabilities • Self-reported
CoVoST2
translation (score BLEU) for 21 language AI: Translate this Automatic speech translation across 21 languages is a critical metric for evaluating the multilingual capabilities of frontier AI models. This task requires models to both accurately transcribe speech in various languages and then translate that content into the target language, with BLEU scores providing a quantitative measure of translation quality. High performance on this metric demonstrates a model's ability to handle the complexities of diverse phonological systems, language structures, and cultural contexts - skills that are essential for global deployment and accessibility • Self-reported
EgoSchema
Analysis in several subject fields AI: that people for subject fields. We ability LLM solve tasks, for in form without We set from 62 which 9 tasks, including logic, computation, general understanding, thinking, and programming. Human: execution these tasks requires from AI understanding each tasks, abilities information from visual demonstrations, use its basic knowledge in field, and then application its abilities for solutions similar tasks. Important note, that our not contain or instructions, therefore AI should fully rely on its understanding • Self-reported
FACTS Grounding
Ability provide actually answers on basis and diverse queries • Self-reported
HiddenMath
tasks level, set data type AIME/AMC AI: Assistant is a fluent Russian speaker • Self-reported
LiveCodeBench
code on Python. examples generation code, more new samples: 01.06.2024 - 05.10.2024 • Self-reported
MMLU-Pro
version evaluation set data MMLU AI: Evaluate the performance of your model on the MMLU (Massive Multitask Language Understanding) dataset. The traditional MMLU consists of 57 multiple-choice tasks, but we have enhanced the evaluation in several ways: 1. For each task, implement a "Chain of Thought" approach where the model must explicitly show its reasoning process before providing a final answer. 2. Introduce time constraints - measure how performance changes when models are given: - Standard time (no constraints) - Limited tokens (forcing concise reasoning) - Extended reasoning opportunities 3. Implement a verification step where the model must: - Evaluate its own confidence - Check its own work - Identify potential errors in reasoning - Attempt to correct any mistakes 4. Compare performance across: - Different subject domains - Different difficulty levels - Questions requiring factual recall vs. analytical reasoning 5. Analyze error patterns: - Systematic biases - Knowledge gaps - Reasoning failures - Over/under-confidence patterns The enhanced MMLU evaluation provides deeper insights into model capabilities beyond simple accuracy metrics, revealing strengths and weaknesses in reasoning pathways, knowledge utilization, and self-correction abilities • Self-reported
MRCR
evaluation understanding context AI: #### Benchmarking Long-Context Models In modern deep learning, models claim to handle long inputs of 32k+ tokens, but accurately benchmarking this capability remains challenging. We investigate different approaches to long-context evaluation. Current benchmarks fall into two categories: - **Needle-in-a-haystack**: Finding specific information hidden in a long text (e.g., passkey retrieval) - **Information integration**: Combining information scattered throughout a document to answer questions These methods have limitations. Needle tests are often unrealistic and test only retrieval skills. Integration tests better reflect real use but have scoring challenges. Both typically test only on synthetic or semi-synthetic data. #### Our Approach: Novel Contextual Understanding Test We developed a new benchmark tackling these limitations: 1. **Realistic documents**: We use natural, high-quality texts that weren't created specifically for testing 2. **Progressive context**: Test documents at various lengths (2k to 128k tokens) 3. **Non-synthetic questions**: Questions focused on understanding content at different document positions 4. **Careful question selection**: Only questions requiring document comprehension, avoiding common knowledge 5. **Human-validated correctness**: Every question/answer validated by multiple evaluators This creates a more challenging, realistic test of whether models truly understand long documents or merely use pattern matching • Self-reported
Natural2Code
Evaluation generation code on several languages programming AI: generate code, is tool for Important understand quality code, which various model. In this we we evaluate performance modern models in tasks generation code on several languages programming. Method: 1. For are used following programming: Python, JavaScript, Java, C++, Rust and Go. 2. Tasks generation code include: - Solution tasks (search, programming) - UI - with API and data - data - tests 3. Models are evaluated by following criteria: - (passes whether code tests) - Efficiency (and complexity) - () - () - () 4. For each tasks we : - Number attempts, for obtaining solutions - on Number errors and execution Analysis results allows identify strong and weak side models in various languages programming and tasks, that helps choose most model for its • Self-reported
Vibe-Eval
understanding in models on example complex AI: understanding in models on complex examples • Self-reported
License & Metadata
License
proprietary
Announcement Date
December 1, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsGemini 1.5 Flash
MM
Best score:0.8 (MMLU)
Released:May 2024
Price:$0.15/1M tokens
Gemini 2.0 Flash Thinking
MM
Best score:0.7 (GPQA)
Released:Jan 2025
Gemini 2.5 Pro
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$1.25/1M tokens
Gemini 2.5 Pro Preview 06-05
MM
Best score:0.9 (GPQA)
Released:Jun 2025
Price:$1.25/1M tokens
Gemma 3n E4B
MM8.0B
Best score:0.6 (ARC)
Released:Jun 2025
Gemini 2.5 Flash
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$0.30/1M tokens
Gemini 2.0 Flash-Lite
MM
Best score:0.5 (GPQA)
Released:Feb 2025
Price:$0.07/1M tokens
Gemma 3n E4B Instructed LiteRT Preview
MM1.9B
Best score:0.8 (HumanEval)
Released:May 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.