Gemini 2.5 Flash-Lite
MultimodalGemini 2.5 Flash-Lite is a model developed by Google DeepMind, designed for a variety of tasks including reasoning, science, math, code generation, and more. It features advanced multilingual performance and long-context understanding capabilities. The model is optimized for low-latency use cases and supports multimodal input with a 1 million token context window.
Key Specifications
Parameters
-
Context
1.0M
Release Date
June 17, 2025
Average Score
40.8%
Timeline
Key dates in the model's history
Announcement
June 17, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
January 1, 2025
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.40
Max Input Tokens
1.0M
Max Output Tokens
65.5K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
Arc
In time no method for that, how language model (LLM) solve tasks with help mode thinking. We discovered, that comparison intermediate reasoning not allows fully understand key process solutions at LLM. Without representations about that, how model solves task, strategy improvements performance. For solutions this problems we developed method research (Interactive Exploration), at which in process solutions tasks model, in her/its reasoning. This includes: 1. in process reasoning, in order to errors 2. Application for verification 3. thinking model with This methodology three key performance model: - knowledge: What model actually reasoning: Ability model apply its knowledge for solutions tasks - thinking: Efficiency use context solutions for correct application knowledge approach to allows how general patterns, so and errors, which not at analysis. He also helps for improvements specific • Self-reported
Programming
Programming skills tests
SWE-Bench Verified
with one In given benchmark we ability model code, which solves task. We model question and we ask her/its code for solutions tasks. Then code and data with In difference from other benchmarks coding, which model should obtain correct result with first attempts. We also model include logic for verification its results before if she/it this This benchmark in where model has access to following : 1. : code for solutions problems. 2. : code on correctness, errors. 3. : all and solution. Model should explicitly between various aspects its thinking • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Diamond Diamond - this verification for improvement accuracy answers language models, which evaluates several solutions for identification correct answer. Process includes in itself solutions for problems, solutions by means of evaluation all and choice most correct solutions. how works Diamond: 1. solutions: set independent solutions tasks. 2. comparison: Each solutions in order to determine, which from them correct. 3. : On basis solutions 4. : solution with Diamond can how with one, so and with several models, that allows use its for improvements performance systems. especially efficient for tasks, requiring step for step reasoning, such how mathematical tasks, and allows model correct errors in its reasoning • Self-reported
Multimodal
Working with images and visual data
MMMU
thinking AI: First all details on or mathematical that you and context, if he Then, in order to answer on question: 1. which task (for example, solution equations, verification evidence, explanation concepts) 2. task on logical steps 3. each step its reasoning 4. all intermediate computation and its results 5. final answer clearly and If on is mathematical expressions, their exactly and showing all steps • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
code AI models increasingly help debug code, make improvements, or implement features from natural language specifications. Code editing evaluates the ability to transform a given piece of code according to specific requirements. Basic aspects of code editing include: - Debugging: Fixing syntax or logical errors in code - Refactoring: Improving code structure without changing functionality - Implementing features: Adding new functionality according to specifications - Code transformation: Converting code between languages or frameworks Advanced aspects include handling complex codebases with multiple files and dependencies, understanding broader architectural implications, and making changes that respect existing patterns and standards. Evaluation methods: - Functional correctness: Does the edited code perform as specified? - Test passing rate: Does the edited code pass all test cases? - Code quality: Is the edited code efficient, maintainable, and following best practices? - Minimal modifications: Does the model make only necessary changes? Typical tasks involve providing code with a description of desired changes. The model must understand both the code's current structure and the requirements for modification. AI: code Models artificial intelligence all code, improvements or functions on basis on language. code evaluates ability code in with aspects code include: - : or logical errors in code - : improvement structure code without functions: addition new code: code between or aspects include work with complex with several and understanding more and and evaluation: - correctness: performs whether code ? - tests: passes whether code all test cases? - Quality code: is whether code and ? - : whether model only necessary ? tasks include provision code with Model should understand how structure code, so and to • Self-reported
AIME 2025
Standard evaluation • Self-reported
FACTS Grounding
accuracy AI: Factuality is definitely a key aspect I consider when evaluating my responses. I check my facts carefully to ensure I'm providing accurate information. When I'm unsure about something, I try to be transparent about that uncertainty rather than presenting speculation as fact. I also avoid making definitive claims on topics where there's significant debate or where the facts are still evolving. One strategy I use is carefully distinguishing between well-established facts, expert consensus, emerging research, and speculative ideas. I'm especially careful with sensitive topics like health information, scientific claims, historical events, and statistical data. If I realize I've made a factual error, I acknowledge it directly and provide the correct information. I believe maintaining factual accuracy is essential for being helpful and trustworthy • Self-reported
Global-MMLU-Lite
performance AI: *Translation with on Russian on presented.* • Self-reported
Humanity's Last Exam
# We we verify, how well well model errors in in difference from correct solutions. ability indicates on then, that model has more understanding domain field. ## Description tasks Method evaluation differs from standard tests, ability models solve mathematical tasks: 1. We we provide model solution (which can be correct or incorrect) 2. model evaluate, correctly whether solution 3. If solution model should error 4. If solution correct, model should this ## Application This task better matches use models AI, when users its solutions and connection. We we use tasks from benchmark MATH, providing model: - solutions from set data - incorrect solutions ## results tests with Claude 3 Opus and GPT-4 show, that model: - incorrect solutions for correct - correct solutions - errors in improvements to ## and Ability correct and incorrect solutions matches more domain field, in difference from or This benchmark also allows: 1. sufficiently whether well model "understands" solution, in order to find in errors 2. can whether model detect errors in complexity 3. not whether model in decision-making solutions (), especially if they but contain errors • Self-reported
LiveCodeBench
code AI ## Task: code On basis several code and errors, solution. ## Method: through analysis 1. **analysis**: code and about 2. **problems**: where specifically error and why. 3. **solutions**: code, which: - problems - programming - code 4. **solutions**: code that he will work in conditions. ## Limitations - only part, maintaining general structure and code - language programming code - If is required context, this in form • Self-reported
MRCR v2
Long context 128k average. 8 • Self-reported
SimpleQA
Factual accuracy AI: Despite on their capabilities, LLM from "" - they sometimes statements, which but actually This usually evaluate by means of verification answers on questions. However existing benchmarks often have limitations: answers can be from training data, questions can be with help search, or they can be knowledge. Evaluation models on general knowledge, which should human with (but not ), can give information about abilities model answer on questions. For example, we we can model about in or main • Self-reported
Vibe-Eval
Reka
AI: Reka • Self-reported
License & Metadata
License
creative_commons_attribution_4_0_license
Announcement Date
June 17, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsGemini 2.5 Flash
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$0.30/1M tokens
Gemini 2.0 Flash-Lite
MM
Best score:0.5 (GPQA)
Released:Feb 2025
Price:$0.07/1M tokens
Gemini 3 Pro
MM
Best score:0.9 (GPQA)
Released:Nov 2025
Price:$2.00/1M tokens
Gemini 3 Flash
MM
Best score:0.9 (GPQA)
Released:Dec 2025
Price:$0.50/1M tokens
Gemini 3.1 Pro
MM
Best score:0.9 (GPQA)
Released:Feb 2026
Price:$2.50/1M tokens
Gemini 2.0 Flash Thinking
MM
Best score:0.7 (GPQA)
Released:Jan 2025
Gemini 1.5 Pro
MM
Best score:0.9 (MMLU)
Released:May 2024
Price:$2.50/1M tokens
Gemini 2.5 Pro
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$1.25/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.