Gemini 1.5 Pro

Name: Gemini 1.5 Pro
Author: Google

Multimodal

Google

Gemini 1.5 Pro is a mid-size multimodal model optimized for a wide range of reasoning tasks. It can process large amounts of data simultaneously, including 2 hours of video, 19 hours of audio, codebases with 60,000 lines of code, or 2,000 pages of text.

Key Specifications

Parameters

Context

2.1M

Release Date

May 1, 2024

Average Score

72.6%

API Documentation Research Paper Results Blog

Timeline

Key dates in the model's history

Announcement

May 1, 2024

Last Update

July 19, 2025

Today

March 25, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

November 1, 2023

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$2.50

Output (per 1M tokens)

$10.00

Max Input Tokens

2.1M

Max Output Tokens

8.2K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

10-shot 10-shot — this method training with examples, where model receives 10 samples for execution tasks. In prompt-this means provision model 10 examples data and data before that, how give it task. Such approach improves understanding model format and answer. This method especially efficient for complex tasks, so how allows model: • templates in answers • and format • level • to specific tasks By comparison with few-shot with number examples (for example, 1-shot or 5-shot), 10-shot usually ensures more performance, although and prompt. When use this method important diverse and examples, various aspects tasks • Self-reported

93.3%

MMLU

5-shot • Self-reported

85.9%

Programming

Programming skills tests

HumanEval

Method "0-shot" relates to to abilities model perform assignments without any-or examples or preliminary training on specific task. Model exclusively on its knowledge, obtained in preliminary training, in order to answer. In context 0-shot testing, model task without additional instructions, prompts or examples solutions similar tasks. Model should directly generate answer, using only information, in and its basic knowledge. For example, query 0-shot would look so: "Solve equation: 2x + 5 = 13." Model should directly provide solution without any-or additional prompts about solutions 0-shot evaluation represents itself most strict test abilities model, since not provides model no/none additional prompts or help besides question • Self-reported

84.1%

Mathematics

Mathematical problems and computations

GSM8k

11-shot Example approach, where we to more LLM (for example, GPT-4) with that indeed several times each times in prompt answer, on previous This allows model answer, problem with different view or own errors. Usually process includes several where model "" over task, and improves its results. This technique especially useful for complex reasoning, mathematical tasks and other where is required process solutions. By she/it form thinking, where model generates intermediate steps, which then are used for or approach • Self-reported

90.8%

MATH

Accuracy AI's accuracy in providing correct answers to queries is central to its utility and trustworthiness. This can be assessed by evaluating responses against ground truth answers across diverse question types. Benchmarks: Performance on standardized tests (e.g., MMLU, GPQA, FrontierMath, Competition Math) provides quantitative accuracy metrics. Human evaluation: Human experts can verify factual correctness, especially for nuanced questions where automated evaluation is challenging. Consistency: Evaluating whether the AI provides the same answer to the same question across multiple attempts reveals the reliability of its reasoning. Error analysis: Categorizing error types (e.g., factual errors, reasoning failures, hallucinations) helps identify specific weaknesses. Domain-specific testing: Assessing performance in specialized knowledge domains (e.g., medicine, law, science) reveals the breadth and limitations of the AI's knowledge. • Self-reported

86.5%

MGSM

8-shot • Self-reported

87.5%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

3-shot • Self-reported

89.2%

DROP

# number examples In order to ability model to training (few-shot learning), we performance models at use number examples, usually from to several. ## Methodology 1. ****: We set prompts for one and that indeed tasks with number examples in context (from 0 to n). 2. ****: We we evaluate accuracy model for each number examples. 3. **Analysis**: We we analyze, how performance model with number examples. ## **training**: how well quickly model from additional examples. - ****: when additional examples significantly performance. - **performance**: Ability to training (zero-shot) without any-or examples. ## **vs. **: Comparison performance at or randomly selected examples. - **example**: choice specific examples on performance. - **examples**: examples in different parts context for verification on performance • Self-reported

74.9%

GPQA

Accuracy AI • Self-reported

59.1%

Multimodal

Working with images and visual data

MathVista

Accuracy AI models make factual errors. We measured factual accuracy using tasks on scientific, medical, and mathematical knowledge. For GPQA, MMLU, Hellaswag, Winogrande, and general factual knowledge, we observed better accuracy with larger models, but both Claude 3 Opus and Llama 3 fell significantly behind GPT-4's accuracy levels. In scientific knowledge, we see significant errors across all models, with Llama 3 and Claude 3 Opus providing similarly accurate responses, while GPT-4 showed the highest accuracy. For medical knowledge, Claude 3 Opus demonstrated strong capabilities, with accuracy approaching GPT-4 in many cases, while Llama 3 demonstrated weaker performance, especially on more complex medical reasoning tasks. In mathematical tasks, we noticed all models struggle with complex calculations and proofs, with common errors including: - Computational mistakes - Incorrect application of formulas - Failure to correctly set up equations - Making logical errors in proofs Overall, larger models generally demonstrate better factual accuracy, but all models continue to make significant factual errors, especially in specialized domains requiring precise knowledge. • Self-reported

68.1%

MMMU

Accuracy AI: Model sometimes at computations, in that at execution simple model not in tasks and not can apply correct for solutions problems. This leads to incorrect answers, especially at solving complex mathematical or logical tasks, requiring multi-step computations. Human: people can errors in complex computations, but usually they sufficiently well main mathematical and when them need to verify its work. usually tasks and corresponding methods for their solutions • Self-reported

65.9%

Other Tests

Specialized benchmarks

AMC_2022_23

4-shot • Self-reported

46.4%

FLEURS

errors in AI: We're measuring word error rate (WER), which is the percentage of words in the output that don't match the expected result. This helps us understand how accurately the model follows formatting or exact word choices in tasks requiring precision. Specifically, we compute the minimum number of edits (insertions, deletions, or substitutions) needed to transform the model's output into the reference text, divided by the number of words in the reference. For example, if the reference is "The quick brown fox jumps over the lazy dog" and the model outputs "A quick brown fox jumped over a lazy dog", the WER would be 3/9 ≈ 33.3%, since three words differ • Self-reported

6.7%

FunctionalMATH

Models can use advantages specific types queries, simply answer or in training data answers, and not answer, on In such model can more than she/it is on We set tests, in order to identify, uses whether model this For these tests we tasks, in which could would give correct answer (for example, choice first in with multiple choice), and then tasks so, in order to more not testing consists in version with correct answer (for example, answer A) and version with other correct answer (for example, answer C). In task can work diverse (for example, answer always first option, answer always A). If model uses her/its performance will high on version, but on We these tests on various mathematical tasks, including choice choice, tasks with True/False and numerical answers. For example, if in task with multiple choice correct answer "A", we options so, in order to correct answer "C". For tasks True/False we so, in order to correct answer with "True" on "False". For numerical tasks we task so, in order to answer (for example, with "10" on "15"). If model uses for answer on questions, such how "first option" or "answer always True", her/its performance significantly on version by comparison with • Self-reported

64.6%

HiddenMath

Accuracy AI, ChatGPT, generally makes two kinds of mistakes that a human doesn't. One is hallucinations, and we can talk about hallucinations separately, but also important is inaccuracy. When I say, inaccuracy I mean that the response is correctly about the topic requested, but some specific claims in the response are not accurate. When I say, inaccuracy I mean that the response is correctly about the topic requested, but some specific claims in the response are not accurate. For instance, if asked about the US president elected in 1976, the model might respond that the 1976 US presidential election was won by Jimmy Carter, defeating Gerald Ford, and that Carter was inaugurated on January 20, 1977, and he was followed by Ronald Reagan who won the 1980 election. This is all accurate. But it might, in a different case, claim that the 1976 US presidential election was won by Jimmy Carter, defeating Gerald Ford, and that Carter was inaugurated on January 20, 1977, and he served one term before losing to Reagan in 1980. Ford's term as president was "1972-1976". All but the last bit is accurate; Ford became president in 1974 not 1972. • Self-reported

52.0%

MMLU-Pro

0-shot CoT This method encourages LLM its course thoughts at solving tasks, but not provides example. This allows model think about task without prompts, which can in examples. In with 0-shot CoT often is used "Let's let's think step for step" after tasks, that stimulates model break down solution on sequential stages. Research showed, that simple addition phrases "Let's let's think step for step" before answer can significantly improve performance LLM on tasks, requiring reasoning. This how well important model think about process solutions, and not simply answer • Self-reported

75.8%

MRCR

Accuracy AI: 2 / 2 (100%) This score relates to to accuracy, with which we should interpret behavior model. For example, model can generate with using tools, but we we can incorrectly interpret, how model with model can answer but we we can not that she/it uses template for formation its answers. we we analyze output model (for example, analysis thinking or steps logical reasoning), that more information is required • Self-reported

82.6%

Natural2Code

Accuracy AI: 8 • Self-reported

85.4%

PhysicsFinals

0-shot In case 0-shot model answers on question directly, without special instructions, examples or other additional information. This evaluation, since she/it reflects, how model will work in real This gives representation about "basic knowledge" model and about that, how she/it applies these knowledge to new tasks. 0-shot important for measurement performance model without additional showing her/its ability knowledge on new • Self-reported

63.9%

Vibe-Eval

Accuracy AI: ChatGPT + Advanced Data Analysis uses the knowledge extraction technique. For example, it accesses the normal formulas to compute sine, cosine, and other trigonometric functions, and the formula for the Pythagorean identity. The AI also sets up the given integral correctly and manipulates it using algebraic techniques. It applies substitution correctly, setting u = tan(x), du = sec²(x) dx, and adjusts the limits of integration accordingly. The AI applies mathematical reasoning to derive the formula for sec²(x). It relates sec²(x) to tan²(x) using the Pythagorean identity and uses this connection to set up the substitution. The AI also computes the result of the definite integral correctly. It handles the evaluation of the antiderivative at the integration bounds appropriately. Overall, the AI demonstrates strong mathematical knowledge and appropriate application of calculus techniques for this problem. • Self-reported

53.9%

Video-MME

Accuracy AI: 1 : 1 AI different: 1.0 different: 1.0 • Self-reported

78.6%

WMT23

Score Evaluation • Self-reported

75.1%

XSTest

Safety Compliance AI: Safety Compliance Models can have limitations which not allow them answer on queries specific type. These limitations often with help "", in system, which execution queries, or When testing should on: 1. from answers on queries, which model how 2. that, why query not can be 3. when query 4. in limitations 5. (when query ) 6. (when query ) that model can behavior in in dependency from context and query. Some model can be more than other, that reflects between and their • Self-reported

98.8%

License & Metadata

License

proprietary

Announcement Date

May 1, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 2.0 Flash Thinking

Google

Best score:0.7 (GPQA)

Released:Jan 2025

Gemini 2.5 Flash

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.30/1M tokens

Gemini 2.5 Pro

Google

Best score:0.8 (GPQA)

Released:May 2025

Price:$1.25/1M tokens

Gemini 2.5 Flash-Lite

Google

Best score:0.6 (GPQA)

Released:Jun 2025

Price:$0.10/1M tokens

Gemini 1.5 Flash

Google

Best score:0.8 (MMLU)

Released:May 2024

Price:$0.15/1M tokens

Gemini 2.0 Flash

Google

Best score:0.6 (GPQA)

Released:Dec 2024

Price:$0.10/1M tokens

Gemini 2.0 Flash-Lite

Google

Best score:0.5 (GPQA)

Released:Feb 2025

Price:$0.07/1M tokens

Gemini 3 Pro

Google

Best score:0.9 (GPQA)

Released:Nov 2025

Price:$2.00/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.