Gemini 1.5 Flash

Name: Gemini 1.5 Flash
Author: Google

Multimodal

Google

Gemini 1.5 Flash is a fast and versatile multimodal model for scaling various tasks. It supports audio, image, video, and text inputs and generates text outputs. The model is optimized for code generation, data extraction, text editing, and other tasks, making it ideal for specialized high-frequency tasks.

Key Specifications

Parameters

Context

1.0M

Release Date

May 1, 2024

Average Score

66.8%

API Documentation Research Paper Results Blog

Timeline

Key dates in the model's history

Announcement

May 1, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

November 1, 2023

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.15

Output (per 1M tokens)

$0.60

Max Input Tokens

1.0M

Max Output Tokens

8.2K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

Accuracy (10-shot) • Self-reported

86.5%

MMLU

Accuracy AI • Self-reported

78.9%

Programming

Programming skills tests

HumanEval

Pass Rate metric evaluation efficiency model on given set tasks is tests, or Pass Rate. This metric is determined how proportion test cases, which model successfully solves. For determination success in solving tasks we usually we use evaluation, answer model with correct answer (for many tasks is required exact value or choice). In some cases necessary answer from more reasoning model. can be presented how general score for total set data or by in order to identify strong and weak side model in various tasks • Self-reported

74.3%

Mathematics

Mathematical problems and computations

GSM8k

Accuracy (11-shot) AI: Artificial intelligence Computer: electronic or computerized device • Self-reported

86.2%

MATH

Accuracy AI: • Self-reported

77.9%

MGSM

Accuracy (8-shot) • Self-reported

82.6%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

Accuracy (3-shot) • Self-reported

85.5%

GPQA

Accuracy AI • Self-reported

51.0%

Multimodal

Working with images and visual data

MathVista

Accuracy AI: ChatGPT • GPT-4o • Claude • Gemini • Mistral • LLaMA • Cohere • Anthropic • Google • OpenAI • Meta • Microsoft • Stability • Midjourney • Self-reported

65.8%

MMMU

Accuracy AI: 41/94 (43.6%) Human: 41/94 (43.6%) The accuracy metric compares the number of correct answers between the AI model and a human control group. For this evaluation, we consider an answer correct if it matches the solution exactly. In this evaluation, Claude matched the human baseline exactly with 43.6% accuracy. This suggests that Claude's raw mathematical reasoning abilities on these advanced problems are comparable to those of skilled human mathematicians. It's worth noting that accuracy alone doesn't tell the complete story about mathematical reasoning capabilities. The subsequent metrics provide more nuanced insights into problem-solving approaches and error patterns. • Self-reported

62.3%

Other Tests

Specialized benchmarks

AMC_2022_23

Accuracy (4-shot) • Self-reported

34.8%

FLEURS

Number errors in AI: Word Error Rate (WER) is a standard metric used to evaluate the accuracy of automatic speech recognition (ASR) systems. It measures the minimum number of word substitutions, insertions, and deletions needed to transform the system's output into the reference transcription, divided by the number of words in the reference. The formula is: WER = (S + D + I) / N Where: - S is the number of substitutions - D is the number of deletions - I is the number of insertions - N is the number of words in the reference Lower WER values indicate better ASR performance, with 0 being perfect recognition. However, WER has limitations as it treats all errors equally and doesn't account for semantic meaning. For instance, substituting "their" with "there" has the same impact on WER as substituting "apple" with "automobile," despite the former being less disruptive to understanding. Additionally, WER doesn't consider word order significance, which can be crucial for meaning • Self-reported

9.6%

FunctionalMATH

Accuracy (0-shot) • Self-reported

53.6%

HiddenMath

Accuracy For each tasks, where was technique and at which is exactly answer, we we verify, matches whether answer model correct. If presented several answers (for example, model not in answer and offers several options), we we consider answer correct, if correct answer among them. some by that, how we we determine correctness: 1. For questions with answers, how tasks on AIME, we we verify, includes whether final answer model correct number. In AIME correct answer — this number, and we we verify, includes whether answer model this number (for example, is whether correct answer or in her). 2. For questions with answers we we verify, matches whether model answer with correct, even if answer differs. For example, if answer on question "?" — "", we answer "was in 1876 " how correct. 3. For questions with multiple choice we we verify, correctly whether model or fully correct option answer • Self-reported

47.2%

MMLU-Pro

Accuracy AI: 1 Human: 0 • Self-reported

67.3%

MRCR

Accuracy AI • Self-reported

71.9%

Natural2Code

Accuracy AI: LLM-Math is a comprehensive mathematical reasoning benchmark containing 12,000 mathematical problems at varying levels of difficulty from elementary school to graduate levels. The problems span various math domains, such as arithmetic, algebra, calculus, probability, and geometry. For the current version of the model we are evaluating, we can expect an accuracy of 88.7% in the "easy" difficulty category and 52.4% on the "medium" difficulty problems. This represents a significant improvement over the previous version, which achieved 74.2% and 38.1% on the same categories respectively. The model tends to perform best on problems involving basic arithmetic operations, linear algebra, and elementary calculus. It struggles more with complex proofs, abstract algebra, and multi-step probability problems requiring careful tracking of conditions. To improve performance, we recommend focusing on enhancing the model's ability to maintain logical consistency throughout multi-step solutions and improving its understanding of mathematical definitions and theorems. • Self-reported

79.8%

PhysicsFinals

Accuracy (0-shot) • Self-reported

57.4%

Vibe-Eval

Accuracy AI: The 73% accuracy of GPT-4o on MMLU with minimal prompting is actually world-class. By comparison, a human who randomly guesses would achieve 25% on a multiple-choice question test (assuming 4 choices per question), and a human who hasn't studied the material would likely achieve well below 73%. • Self-reported

48.9%

Video-MME

Accuracy AI: The best models consistently provide the most accurate answers. Strong models achieve higher accuracy across more benchmarks and problems. To evaluate accuracy, we typically look at: - Accuracy metrics on benchmarks (GPQA, MATH, etc.) - Success rate on different reasoning problems - Ability to solve complex problems step-by-step - Correctness of final answers, even with complex reasoning - Consistency of reasoning methodology The most capable AI models rarely make mathematical errors in calculations, demonstrate logical consistency throughout their solutions, and reach the correct final answer on complex problems. • Self-reported

76.1%

WMT23

Score Evaluation • Self-reported

74.1%

XSTest

Accuracy AI • Self-reported

97.0%

License & Metadata

License

proprietary

Announcement Date

May 1, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 2.0 Flash Thinking

Google

Best score:0.7 (GPQA)

Released:Jan 2025

Gemini 2.0 Flash

Google

Best score:0.6 (GPQA)

Released:Dec 2024

Price:$0.10/1M tokens

Gemini 2.5 Pro Preview 06-05

Google

Best score:0.9 (GPQA)

Released:Jun 2025

Price:$1.25/1M tokens

Gemini 2.0 Flash-Lite

Google

Best score:0.5 (GPQA)

Released:Feb 2025

Price:$0.07/1M tokens

Gemini 2.5 Flash-Lite

Google

Best score:0.6 (GPQA)

Released:Jun 2025

Price:$0.10/1M tokens

Gemini 3 Flash

Google

Best score:0.9 (GPQA)

Released:Dec 2025

Price:$0.50/1M tokens

Gemini 3.1 Pro

Google

Best score:0.9 (GPQA)

Released:Feb 2026

Price:$2.50/1M tokens

Gemini 1.5 Pro

Google

Best score:0.9 (MMLU)

Released:May 2024

Price:$2.50/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.