Google logo

Gemini 2.0 Flash Thinking

Multimodal
Google

Gemini 2.0 Flash Thinking is an enhanced reasoning model capable of showing its thought processes for improved performance and explainability. Combining speed and performance, Gemini 2.0 Flash Thinking also excels at science and math tasks, showing its reasoning when solving complex problems.

Key Specifications

Parameters
-
Context
-
Release Date
January 21, 2025
Average Score
74.3%

Timeline

Key dates in the model's history
Announcement
January 21, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
August 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis
GPQA
Challenging science questions requiring chain-of-thought reasoning AI systems have made tremendous strides in answering factual questions, but complex science problems that require multi-step reasoning and domain knowledge remain challenging. This task involves a set of science questions from various domains (physics, chemistry, biology, etc.) that require the model to: 1. Break down complex problems into logical steps 2. Apply scientific principles and formulas correctly 3. Reason through each step sequentially 4. Show calculations when necessary 5. Arrive at accurate conclusions The questions are designed to test both factual knowledge and the ability to use that knowledge in a logical reasoning chain. For example, a physics problem might require calculating forces, then using those values to determine if an object will move, and finally explaining the real-world implications. Success on this task requires not just memorized facts but the ability to connect concepts across domains and apply them appropriately in novel scenarios - mirroring how human experts solve scientific problems.Self-reported
74.2%

Multimodal

Working with images and visual data
MMMU
Questions and answers by and in various fields AI: Translate, following text about method analysis model AI: # Experiment: LMSYS Olympiad ## Motivation LMSYS has run a series of "Olympiad" competitions where they crowdsource head-to-head comparisons between two AI assistants. This produces a win-rate tournament. In our own comparisons, we found a significant contrast between Olympiad win rates and benchmark performance. ## Procedure We analyze results from the LMSYS Olympiad in Arena (March 2024 Leaderboard). We focus on this leaderboard because it includes all of the current leading commercial and open models (e.g., Claude 3, GPT-4, and Llama 3). We download the full set of head-to-head win rates and convert these into an overall win-rate ranking (accounting for the fact that not all models played against each other the same number of times). ## Results We find a strong disconnect between LMSYS Olympiad rankings and benchmark rankings. For instance, Claude 3 Opus (which sits near the top of most capability benchmarks) is ranked #5 in the Olympiad, below GPT-4 Turbo, Claude 3 Sonnet, and even Claude 2. Llama 3 70B Instruct has a particularly weak showing, placing far below much smaller models like Mistral 7B. A notable issue with the Olympiad is that many votes come from deliberately adversarial prompts, which makes sense given that the crowdsourced voters are incentivized to try to find edge cases where models differ. This means that model behaviors like refusing to respond to certain prompts could have an outsized impact on these rankings. We found several examples where Claude 3 Opus declined to answer questions that other models answered, and this appeared to frequently lead human voters to prefer the more compliant modelSelf-reported
75.4%

Other Tests

Specialized benchmarks
AIME 2024
# reasoning at solving mathematical tasks level AI assistantSelf-reported
73.3%

License & Metadata

License
proprietary
Announcement Date
January 21, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.