Gemini 2.0 Flash-Lite

Name: Gemini 2.0 Flash-Lite
Author: Google

Multimodal

Google

Gemini 2.0 Flash model optimized for cost efficiency and low latency

Key Specifications

Parameters

Context

1.0M

Release Date

February 5, 2025

Average Score

59.0%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

February 5, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

June 1, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.07

Output (per 1M tokens)

$0.30

Max Input Tokens

1.0M

Max Output Tokens

8.2K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Mathematics

Mathematical problems and computations

MATH

Standard AI: Translate text • Self-reported

86.8%

Reasoning

Logical reasoning and analysis

GPQA

Diamond In context systems artificial intelligence Diamond - this approach to abilities models solve complex tasks. Diamond evaluates capabilities model by means of provision tasks and then information, how would "prompts" or "" to solving. Each prompt allows model on problem with new For example, model can be complex task. If she/it not can her/its solve, prompt, for example, problem with using equations. If model still not can solve task, is provided still one prompt, for example, on specific step or This approach useful for: 1. Evaluations knowledge and skills model 2. that, which level prompts for solutions tasks 3. different models by their abilities solve tasks with different prompts Diamond also can identify, how model use information and how well they new prompts in its process solutions • Self-reported

51.5%

Multimodal

Working with images and visual data

MMMU

multimodal tasks on understanding and reasoning level • Self-reported

68.0%

Other Tests

Specialized benchmarks

Bird-SQL (dev)

# Evaluation We model GPT-4o on set complex tasks and we evaluate her/its performance with help and methods evaluation. fields, in which model demonstrates significant improvements or limitations. ## evaluation We GPT-4o on several tests, results with models GPT, and also with other models, how Claude and Gemini. These evaluation include: - **benchmarks**: MMLU, HumanEval, GPQA and other tests. - **tasks**: from competitions, such how AIME and FrontierMath. - **Reasoning on language**: Tasks logical output and understanding context. - **Multimodal processing**: on images, and ## evaluation samples answers GPT-4o, their with other models. This evaluation includes: - **Accuracy**: Correctness facts and logical conclusions. - ****: answers for users. - **domain field**: knowledge in specialized fields. - ****: Ability generate solutions and ****: How well well model should instructions and to ## limitations We known limitations previous models, in order to determine, were whether they in GPT-4o: - ****: How well often model generates information. - **errors**: Accuracy in complex and tasks. - **knowledge**: information and about **to **: to limitations. - ****: Ability or These evaluation us representation about capabilities and • Self-reported

57.4%

CoVoST2

translation (score BLEU) on 21 language AI: Self-evaluate using an automatic translation benchmark called BLEU. For each language pair, the AI must translate 100 short sentences. Scores are normalized from 0-100 based on comparison with human reference translations. Method details: - Each language pair (e.g., English→Japanese) includes both common and technical sentences - Equal weighting across all language pairs regardless of difficulty - Proper handling of non-Latin scripts and dialects is essential - Automatic scoring prevents "gaming" the benchmark - Evaluation corpus spans topics including medical, legal, technical and casual conversation Key languages evaluated include: Arabic, Bengali, Chinese, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swahili, Tagalog, Tamil, Thai, Turkish, Ukrainian, and Vietnamese • Self-reported

38.4%

EgoSchema

Analysis in several subject fields AI: Translate following text: # PIC2TEXT: LEVERAGING VISION ENCODERS FOR PROGRAMMING IN CONTEXT Large language models (LLMs) have shown remarkable capabilities in many domains, including program synthesis. However, when it comes to understanding and manipulating images, these models face significant limitations. In this paper, we present PIC2TEXT, a novel approach to convert multimodal input (text and images) into a purely textual format, allowing us to leverage the full power of LLMs for programming with image inputs. PIC2TEXT uses a high-quality vision encoder to translate images into detailed textual descriptions, which are then combined with the original text to form a unified textual representation. This transformation enables LLMs to understand and reason about visual content in the context of programming tasks, without requiring multimodal training. Through extensive experiments on programming tasks involving image processing, scene understanding, and object recognition, we demonstrate that PIC2TEXT significantly enhances LLMs' ability to write code that correctly processes image inputs. Our results show that this approach enables LLMs to tackle a variety of image-related programming tasks across Python, JavaScript, and other languages, opening up new possibilities for image-based programming assistance • Self-reported

67.2%

FACTS Grounding

- evaluation • Self-reported

83.6%

Global-MMLU-Lite

0-shot evaluation AI: 0-shot means, that we not we provide model examples that, how perform task, before that how it give assignment. can use 0-shot evaluation for measurement abilities model to generalization, when she/it with new task, which not This can be verify, indeed whether model understands task or simply solutions from data, on which she/it • Self-reported

78.2%

HiddenMath

0-shot evaluation AI : It's interesting that you specify no prompt for GPQA or MATH, since I know they usually involve multi-step reasoning. Let me work through this carefully. For zero-shot evaluation on GPQA and MATH datasets: 1. I'll receive questions without any specific instructions or examples. 2. For each question, I need to: - Understand what's being asked - Formulate a clear solution approach - Work through the problem step-by-step - Verify my reasoning - Provide a final answer in the expected format For GPQA specifically, which contains graduate-level physics questions, I'll need to: - Apply fundamental physics principles - Perform mathematical derivations when needed - Use appropriate formulas and theories - Express answers with correct units and precision For MATH problems, which include competition-style mathematics: - Identify the mathematical domains involved - Apply relevant theorems and techniques - Show complete work/derivation - Verify solutions through cross-checking I'll maintain clarity in my reasoning and ensure answers are precise and well-justified, even without specific prompting instructions. • Self-reported

55.3%

LiveCodeBench v5

# Pass@1 solutions tasks for one attempt. This metric is one from main ways measurement abilities model solve tasks. Pass@1 (also known how "accuracy first attempts") indicates, which percentage tasks model solves correctly at generation one answer. Since answers model often contain Pass@1 can measure, several attempts on each task and proportion correct solutions. In for this usually is used evaluation pass@k: Pass@1 = 1 - (1 - c/n)^k where c — number correct solutions among n attempts, and k = 1. Indicator Pass@1 is metric for tasks, including mathematical and scientific puzzles and tasks with clearly correctness answer • Self-reported

28.9%

MMLU-Pro

Accuracy chains reasoning This method evaluates intermediate steps reasoning model, and not only final answer. Each step should be correct for obtaining solutions. Advantages: - more analysis abilities model to reasoning - where specifically errors in reasoning - model to its process Disadvantages: - evaluation; often is required verification experts - in evaluation each step - different approaches to solving one and that indeed problems Examples application: - Evaluation mathematical tasks, requiring solutions - puzzles, where important sequence output - Tasks programming, where necessary track logic code • Self-reported

71.6%

MRCR 1M

Accuracy understanding context AI: (>50K tokens), such how or scientific : questions, with in different Method evaluation: model answer on these questions. Then answers with : - questions complexity, from simple facts to those, that require information from different ability model find information in accuracy at numerical data and from text - understanding by • Self-reported

58.0%

SimpleQA

Factual accuracy AI: understanding not relying on on : verifies quality answer through : evaluate basic knowledge model, determine to and measure accuracy. Advantages: measures accuracy information, incorrect representations. Disadvantages: model can even at verification requires time. : model often demonstrate various actual accuracy in different fields knowledge. through important • Self-reported

21.7%

License & Metadata

License

proprietary

Announcement Date

February 5, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 2.5 Flash

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.30/1M tokens

Gemini 2.5 Flash-Lite

Google

Best score:0.6 (GPQA)

Released:Jun 2025

Price:$0.10/1M tokens

Gemini 3 Pro

Google

Best score:0.9 (GPQA)

Released:Nov 2025

Price:$2.00/1M tokens

Gemini 3 Flash

Google

Best score:0.9 (GPQA)

Released:Dec 2025

Price:$0.50/1M tokens

Gemini 3.1 Pro

Google

Best score:0.9 (GPQA)

Released:Feb 2026

Price:$2.50/1M tokens

Gemini 2.0 Flash Thinking

Google

Best score:0.7 (GPQA)

Released:Jan 2025

Gemini 1.5 Pro

Google

Best score:0.9 (MMLU)

Released:May 2024

Price:$2.50/1M tokens

Gemini 2.5 Pro

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$1.25/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.