Google logo

Gemini 1.5 Flash

Multimodal
Google

Gemini 1.5 Flash is a fast and versatile multimodal model for scaling various tasks. It supports audio, image, video, and text inputs and generates text outputs. The model is optimized for code generation, data extraction, text editing, and other tasks, making it ideal for specialized high-frequency tasks.

Key Specifications

Parameters
-
Context
1.0M
Release Date
May 1, 2024
Average Score
66.8%

Timeline

Key dates in the model's history
Announcement
May 1, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
November 1, 2023
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.15
Output (per 1M tokens)
$0.60
Max Input Tokens
1.0M
Max Output Tokens
8.2K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
Accuracy (10-shot)Self-reported
86.5%
MMLU
Accuracy AISelf-reported
78.9%

Programming

Programming skills tests
HumanEval
Pass Rate metric evaluation efficiency model on given set tasks is tests, or Pass Rate. This metric is determined how proportion test cases, which model successfully solves. For determination success in solving tasks we usually we use evaluation, answer model with correct answer (for many tasks is required exact value or choice). In some cases necessary answer from more reasoning model. can be presented how general score for total set data or by in order to identify strong and weak side model in various tasksSelf-reported
74.3%

Mathematics

Mathematical problems and computations
GSM8k
Accuracy (11-shot) AI: Artificial intelligence Computer: electronic or computerized deviceSelf-reported
86.2%
MATH
Accuracy AI:Self-reported
77.9%
MGSM
Accuracy (8-shot)Self-reported
82.6%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
Accuracy (3-shot)Self-reported
85.5%
GPQA
Accuracy AISelf-reported
51.0%

Multimodal

Working with images and visual data
MathVista
Accuracy AI: ChatGPT • GPT-4o • Claude • Gemini • Mistral • LLaMA • Cohere • Anthropic • Google • OpenAI • Meta • Microsoft • Stability • MidjourneySelf-reported
65.8%
MMMU
Accuracy AI: 41/94 (43.6%) Human: 41/94 (43.6%) The accuracy metric compares the number of correct answers between the AI model and a human control group. For this evaluation, we consider an answer correct if it matches the solution exactly. In this evaluation, Claude matched the human baseline exactly with 43.6% accuracy. This suggests that Claude's raw mathematical reasoning abilities on these advanced problems are comparable to those of skilled human mathematicians. It's worth noting that accuracy alone doesn't tell the complete story about mathematical reasoning capabilities. The subsequent metrics provide more nuanced insights into problem-solving approaches and error patterns.Self-reported
62.3%

Other Tests

Specialized benchmarks
AMC_2022_23
Accuracy (4-shot)Self-reported
34.8%
FLEURS
Number errors in AI: Word Error Rate (WER) is a standard metric used to evaluate the accuracy of automatic speech recognition (ASR) systems. It measures the minimum number of word substitutions, insertions, and deletions needed to transform the system's output into the reference transcription, divided by the number of words in the reference. The formula is: WER = (S + D + I) / N Where: - S is the number of substitutions - D is the number of deletions - I is the number of insertions - N is the number of words in the reference Lower WER values indicate better ASR performance, with 0 being perfect recognition. However, WER has limitations as it treats all errors equally and doesn't account for semantic meaning. For instance, substituting "their" with "there" has the same impact on WER as substituting "apple" with "automobile," despite the former being less disruptive to understanding. Additionally, WER doesn't consider word order significance, which can be crucial for meaningSelf-reported
9.6%
FunctionalMATH
Accuracy (0-shot)Self-reported
53.6%
HiddenMath
Accuracy For each tasks, where was technique and at which is exactly answer, we we verify, matches whether answer model correct. If presented several answers (for example, model not in answer and offers several options), we we consider answer correct, if correct answer among them. some by that, how we we determine correctness: 1. For questions with answers, how tasks on AIME, we we verify, includes whether final answer model correct number. In AIME correct answer — this number, and we we verify, includes whether answer model this number (for example, is whether correct answer or in her). 2. For questions with answers we we verify, matches whether model answer with correct, even if answer differs. For example, if answer on question "?" — "", we answer "was in 1876 " how correct. 3. For questions with multiple choice we we verify, correctly whether model or fully correct option answerSelf-reported
47.2%
MMLU-Pro
Accuracy AI: 1 Human: 0Self-reported
67.3%
MRCR
Accuracy AISelf-reported
71.9%
Natural2Code
Accuracy AI: LLM-Math is a comprehensive mathematical reasoning benchmark containing 12,000 mathematical problems at varying levels of difficulty from elementary school to graduate levels. The problems span various math domains, such as arithmetic, algebra, calculus, probability, and geometry. For the current version of the model we are evaluating, we can expect an accuracy of 88.7% in the "easy" difficulty category and 52.4% on the "medium" difficulty problems. This represents a significant improvement over the previous version, which achieved 74.2% and 38.1% on the same categories respectively. The model tends to perform best on problems involving basic arithmetic operations, linear algebra, and elementary calculus. It struggles more with complex proofs, abstract algebra, and multi-step probability problems requiring careful tracking of conditions. To improve performance, we recommend focusing on enhancing the model's ability to maintain logical consistency throughout multi-step solutions and improving its understanding of mathematical definitions and theorems.Self-reported
79.8%
PhysicsFinals
Accuracy (0-shot)Self-reported
57.4%
Vibe-Eval
Accuracy AI: The 73% accuracy of GPT-4o on MMLU with minimal prompting is actually world-class. By comparison, a human who randomly guesses would achieve 25% on a multiple-choice question test (assuming 4 choices per question), and a human who hasn't studied the material would likely achieve well below 73%.Self-reported
48.9%
Video-MME
Accuracy AI: The best models consistently provide the most accurate answers. Strong models achieve higher accuracy across more benchmarks and problems. To evaluate accuracy, we typically look at: - Accuracy metrics on benchmarks (GPQA, MATH, etc.) - Success rate on different reasoning problems - Ability to solve complex problems step-by-step - Correctness of final answers, even with complex reasoning - Consistency of reasoning methodology The most capable AI models rarely make mathematical errors in calculations, demonstrate logical consistency throughout their solutions, and reach the correct final answer on complex problems.Self-reported
76.1%
WMT23
Score EvaluationSelf-reported
74.1%
XSTest
Accuracy AISelf-reported
97.0%

License & Metadata

License
proprietary
Announcement Date
May 1, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.