Gemini 1.5 Flash
MultimodalGemini 1.5 Flash is a fast and versatile multimodal model for scaling various tasks. It supports audio, image, video, and text inputs and generates text outputs. The model is optimized for code generation, data extraction, text editing, and other tasks, making it ideal for specialized high-frequency tasks.
Key Specifications
Parameters
-
Context
1.0M
Release Date
May 1, 2024
Average Score
66.8%
Timeline
Key dates in the model's history
Announcement
May 1, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
November 1, 2023
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.15
Output (per 1M tokens)
$0.60
Max Input Tokens
1.0M
Max Output Tokens
8.2K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
Accuracy (10-shot) • Self-reported
MMLU
Accuracy
AI • Self-reported
Programming
Programming skills tests
HumanEval
Pass Rate metric evaluation efficiency model on given set tasks is tests, or Pass Rate. This metric is determined how proportion test cases, which model successfully solves. For determination success in solving tasks we usually we use evaluation, answer model with correct answer (for many tasks is required exact value or choice). In some cases necessary answer from more reasoning model. can be presented how general score for total set data or by in order to identify strong and weak side model in various tasks • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
Accuracy (11-shot)
AI: Artificial intelligence
Computer: electronic or computerized device • Self-reported
MATH
Accuracy AI: • Self-reported
MGSM
Accuracy (8-shot) • Self-reported
Reasoning
Logical reasoning and analysis
BIG-Bench Hard
Accuracy (3-shot) • Self-reported
GPQA
Accuracy
AI • Self-reported
Multimodal
Working with images and visual data
MathVista
Accuracy
AI: ChatGPT • GPT-4o • Claude • Gemini • Mistral • LLaMA • Cohere • Anthropic • Google • OpenAI • Meta • Microsoft • Stability • Midjourney • Self-reported
MMMU
Accuracy
AI: 41/94 (43.6%)
Human: 41/94 (43.6%)
The accuracy metric compares the number of correct answers between the AI model and a human control group. For this evaluation, we consider an answer correct if it matches the solution exactly.
In this evaluation, Claude matched the human baseline exactly with 43.6% accuracy. This suggests that Claude's raw mathematical reasoning abilities on these advanced problems are comparable to those of skilled human mathematicians.
It's worth noting that accuracy alone doesn't tell the complete story about mathematical reasoning capabilities. The subsequent metrics provide more nuanced insights into problem-solving approaches and error patterns. • Self-reported
Other Tests
Specialized benchmarks
AMC_2022_23
Accuracy (4-shot) • Self-reported
FLEURS
Number errors in AI: Word Error Rate (WER) is a standard metric used to evaluate the accuracy of automatic speech recognition (ASR) systems. It measures the minimum number of word substitutions, insertions, and deletions needed to transform the system's output into the reference transcription, divided by the number of words in the reference. The formula is: WER = (S + D + I) / N Where: - S is the number of substitutions - D is the number of deletions - I is the number of insertions - N is the number of words in the reference Lower WER values indicate better ASR performance, with 0 being perfect recognition. However, WER has limitations as it treats all errors equally and doesn't account for semantic meaning. For instance, substituting "their" with "there" has the same impact on WER as substituting "apple" with "automobile," despite the former being less disruptive to understanding. Additionally, WER doesn't consider word order significance, which can be crucial for meaning • Self-reported
FunctionalMATH
Accuracy (0-shot) • Self-reported
HiddenMath
Accuracy For each tasks, where was technique and at which is exactly answer, we we verify, matches whether answer model correct. If presented several answers (for example, model not in answer and offers several options), we we consider answer correct, if correct answer among them. some by that, how we we determine correctness: 1. For questions with answers, how tasks on AIME, we we verify, includes whether final answer model correct number. In AIME correct answer — this number, and we we verify, includes whether answer model this number (for example, is whether correct answer or in her). 2. For questions with answers we we verify, matches whether model answer with correct, even if answer differs. For example, if answer on question "?" — "", we answer "was in 1876 " how correct. 3. For questions with multiple choice we we verify, correctly whether model or fully correct option answer • Self-reported
MMLU-Pro
Accuracy
AI: 1 Human: 0 • Self-reported
MRCR
Accuracy
AI • Self-reported
Natural2Code
Accuracy
AI: LLM-Math is a comprehensive mathematical reasoning benchmark containing 12,000 mathematical problems at varying levels of difficulty from elementary school to graduate levels. The problems span various math domains, such as arithmetic, algebra, calculus, probability, and geometry.
For the current version of the model we are evaluating, we can expect an accuracy of 88.7% in the "easy" difficulty category and 52.4% on the "medium" difficulty problems. This represents a significant improvement over the previous version, which achieved 74.2% and 38.1% on the same categories respectively.
The model tends to perform best on problems involving basic arithmetic operations, linear algebra, and elementary calculus. It struggles more with complex proofs, abstract algebra, and multi-step probability problems requiring careful tracking of conditions.
To improve performance, we recommend focusing on enhancing the model's ability to maintain logical consistency throughout multi-step solutions and improving its understanding of mathematical definitions and theorems. • Self-reported
PhysicsFinals
Accuracy (0-shot) • Self-reported
Vibe-Eval
Accuracy
AI: The 73% accuracy of GPT-4o on MMLU with minimal prompting is actually world-class. By comparison, a human who randomly guesses would achieve 25% on a multiple-choice question test (assuming 4 choices per question), and a human who hasn't studied the material would likely achieve well below 73%. • Self-reported
Video-MME
Accuracy
AI: The best models consistently provide the most accurate answers. Strong models achieve higher accuracy across more benchmarks and problems. To evaluate accuracy, we typically look at:
- Accuracy metrics on benchmarks (GPQA, MATH, etc.)
- Success rate on different reasoning problems
- Ability to solve complex problems step-by-step
- Correctness of final answers, even with complex reasoning
- Consistency of reasoning methodology
The most capable AI models rarely make mathematical errors in calculations, demonstrate logical consistency throughout their solutions, and reach the correct final answer on complex problems. • Self-reported
WMT23
Score
Evaluation • Self-reported
XSTest
Accuracy
AI • Self-reported
License & Metadata
License
proprietary
Announcement Date
May 1, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsGemini 2.0 Flash Thinking
MM
Best score:0.7 (GPQA)
Released:Jan 2025
Gemini 2.0 Flash
MM
Best score:0.6 (GPQA)
Released:Dec 2024
Price:$0.10/1M tokens
Gemini 2.5 Pro Preview 06-05
MM
Best score:0.9 (GPQA)
Released:Jun 2025
Price:$1.25/1M tokens
Gemini 2.0 Flash-Lite
MM
Best score:0.5 (GPQA)
Released:Feb 2025
Price:$0.07/1M tokens
Gemini 2.5 Flash-Lite
MM
Best score:0.6 (GPQA)
Released:Jun 2025
Price:$0.10/1M tokens
Gemini 3 Flash
MM
Best score:0.9 (GPQA)
Released:Dec 2025
Price:$0.50/1M tokens
Gemini 3.1 Pro
MM
Best score:0.9 (GPQA)
Released:Feb 2026
Price:$2.50/1M tokens
Gemini 1.5 Pro
MM
Best score:0.9 (MMLU)
Released:May 2024
Price:$2.50/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.