Gemini 2.5 Flash-Lite

Name: Gemini 2.5 Flash-Lite
Author: Google

Multimodal

Google

Gemini 2.5 Flash-Lite is a model developed by Google DeepMind, designed for a variety of tasks including reasoning, science, math, code generation, and more. It features advanced multilingual performance and long-context understanding capabilities. The model is optimized for low-latency use cases and supports multimodal input with a 1 million token context window.

Key Specifications

Parameters

Context

1.0M

Release Date

June 17, 2025

Average Score

40.8%

API Documentation Research Paper Results Blog

Timeline

Key dates in the model's history

Announcement

June 17, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

January 1, 2025

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.10

Output (per 1M tokens)

$0.40

Max Input Tokens

1.0M

Max Output Tokens

65.5K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

Arc

In time no method for that, how language model (LLM) solve tasks with help mode thinking. We discovered, that comparison intermediate reasoning not allows fully understand key process solutions at LLM. Without representations about that, how model solves task, strategy improvements performance. For solutions this problems we developed method research (Interactive Exploration), at which in process solutions tasks model, in her/its reasoning. This includes: 1. in process reasoning, in order to errors 2. Application for verification 3. thinking model with This methodology three key performance model: - knowledge: What model actually reasoning: Ability model apply its knowledge for solutions tasks - thinking: Efficiency use context solutions for correct application knowledge approach to allows how general patterns, so and errors, which not at analysis. He also helps for improvements specific • Self-reported

2.5%

Programming

Programming skills tests

SWE-Bench Verified

with one In given benchmark we ability model code, which solves task. We model question and we ask her/its code for solutions tasks. Then code and data with In difference from other benchmarks coding, which model should obtain correct result with first attempts. We also model include logic for verification its results before if she/it this This benchmark in where model has access to following : 1. : code for solutions problems. 2. : code on correctness, errors. 3. : all and solution. Model should explicitly between various aspects its thinking • Self-reported

31.6%

Reasoning

Logical reasoning and analysis

GPQA

Diamond Diamond - this verification for improvement accuracy answers language models, which evaluates several solutions for identification correct answer. Process includes in itself solutions for problems, solutions by means of evaluation all and choice most correct solutions. how works Diamond: 1. solutions: set independent solutions tasks. 2. comparison: Each solutions in order to determine, which from them correct. 3. : On basis solutions 4. : solution with Diamond can how with one, so and with several models, that allows use its for improvements performance systems. especially efficient for tasks, requiring step for step reasoning, such how mathematical tasks, and allows model correct errors in its reasoning • Self-reported

64.6%

Multimodal

Working with images and visual data

MMMU

thinking AI: First all details on or mathematical that you and context, if he Then, in order to answer on question: 1. which task (for example, solution equations, verification evidence, explanation concepts) 2. task on logical steps 3. each step its reasoning 4. all intermediate computation and its results 5. final answer clearly and If on is mathematical expressions, their exactly and showing all steps • Self-reported

72.9%

Other Tests

Specialized benchmarks

Aider-Polyglot

code AI models increasingly help debug code, make improvements, or implement features from natural language specifications. Code editing evaluates the ability to transform a given piece of code according to specific requirements. Basic aspects of code editing include: - Debugging: Fixing syntax or logical errors in code - Refactoring: Improving code structure without changing functionality - Implementing features: Adding new functionality according to specifications - Code transformation: Converting code between languages or frameworks Advanced aspects include handling complex codebases with multiple files and dependencies, understanding broader architectural implications, and making changes that respect existing patterns and standards. Evaluation methods: - Functional correctness: Does the edited code perform as specified? - Test passing rate: Does the edited code pass all test cases? - Code quality: Is the edited code efficient, maintainable, and following best practices? - Minimal modifications: Does the model make only necessary changes? Typical tasks involve providing code with a description of desired changes. The model must understand both the code's current structure and the requirements for modification. AI: code Models artificial intelligence all code, improvements or functions on basis on language. code evaluates ability code in with aspects code include: - : or logical errors in code - : improvement structure code without functions: addition new code: code between or aspects include work with complex with several and understanding more and and evaluation: - correctness: performs whether code ? - tests: passes whether code all test cases? - Quality code: is whether code and ? - : whether model only necessary ? tasks include provision code with Model should understand how structure code, so and to • Self-reported

26.7%

AIME 2025

Standard evaluation • Self-reported

49.8%

FACTS Grounding

accuracy AI: Factuality is definitely a key aspect I consider when evaluating my responses. I check my facts carefully to ensure I'm providing accurate information. When I'm unsure about something, I try to be transparent about that uncertainty rather than presenting speculation as fact. I also avoid making definitive claims on topics where there's significant debate or where the facts are still evolving. One strategy I use is carefully distinguishing between well-established facts, expert consensus, emerging research, and speculative ideas. I'm especially careful with sensitive topics like health information, scientific claims, historical events, and statistical data. If I realize I've made a factual error, I acknowledge it directly and provide the correct information. I believe maintaining factual accuracy is essential for being helpful and trustworthy • Self-reported

84.1%

Global-MMLU-Lite

performance AI: *Translation with on Russian on presented.* • Self-reported

81.1%

Humanity's Last Exam

# We we verify, how well well model errors in in difference from correct solutions. ability indicates on then, that model has more understanding domain field. ## Description tasks Method evaluation differs from standard tests, ability models solve mathematical tasks: 1. We we provide model solution (which can be correct or incorrect) 2. model evaluate, correctly whether solution 3. If solution model should error 4. If solution correct, model should this ## Application This task better matches use models AI, when users its solutions and connection. We we use tasks from benchmark MATH, providing model: - solutions from set data - incorrect solutions ## results tests with Claude 3 Opus and GPT-4 show, that model: - incorrect solutions for correct - correct solutions - errors in improvements to ## and Ability correct and incorrect solutions matches more domain field, in difference from or This benchmark also allows: 1. sufficiently whether well model "understands" solution, in order to find in errors 2. can whether model detect errors in complexity 3. not whether model in decision-making solutions (), especially if they but contain errors • Self-reported

5.1%

LiveCodeBench

code AI ## Task: code On basis several code and errors, solution. ## Method: through analysis 1. **analysis**: code and about 2. **problems**: where specifically error and why. 3. **solutions**: code, which: - problems - programming - code 4. **solutions**: code that he will work in conditions. ## Limitations - only part, maintaining general structure and code - language programming code - If is required context, this in form • Self-reported

33.7%

MRCR v2

Long context 128k average. 8 • Self-reported

16.6%

SimpleQA

Factual accuracy AI: Despite on their capabilities, LLM from "" - they sometimes statements, which but actually This usually evaluate by means of verification answers on questions. However existing benchmarks often have limitations: answers can be from training data, questions can be with help search, or they can be knowledge. Evaluation models on general knowledge, which should human with (but not ), can give information about abilities model answer on questions. For example, we we can model about in or main • Self-reported

10.7%

Vibe-Eval

Reka AI: Reka • Self-reported

51.3%

License & Metadata

License

creative_commons_attribution_4_0_license

Announcement Date

June 17, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 2.5 Flash

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.30/1M tokens

Gemini 2.0 Flash-Lite

Google

Best score:0.5 (GPQA)

Released:Feb 2025

Price:$0.07/1M tokens

Gemini 3 Pro

Google

Best score:0.9 (GPQA)

Released:Nov 2025

Price:$2.00/1M tokens

Gemini 3 Flash

Google

Best score:0.9 (GPQA)

Released:Dec 2025

Price:$0.50/1M tokens

Gemini 3.1 Pro

Google

Best score:0.9 (GPQA)

Released:Feb 2026

Price:$2.50/1M tokens

Gemini 2.0 Flash Thinking

Google

Best score:0.7 (GPQA)

Released:Jan 2025

Gemini 1.5 Pro

Google

Best score:0.9 (MMLU)

Released:May 2024

Price:$2.50/1M tokens

Gemini 2.5 Pro

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$1.25/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.