Key Specifications
Parameters
-
Context
2.1M
Release Date
May 1, 2024
Average Score
72.6%
Timeline
Key dates in the model's history
Announcement
May 1, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
November 1, 2023
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$2.50
Output (per 1M tokens)
$10.00
Max Input Tokens
2.1M
Max Output Tokens
8.2K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
10-shot 10-shot — this method training with examples, where model receives 10 samples for execution tasks. In prompt-this means provision model 10 examples data and data before that, how give it task. Such approach improves understanding model format and answer. This method especially efficient for complex tasks, so how allows model: • templates in answers • and format • level • to specific tasks By comparison with few-shot with number examples (for example, 1-shot or 5-shot), 10-shot usually ensures more performance, although and prompt. When use this method important diverse and examples, various aspects tasks • Self-reported
MMLU
5-shot • Self-reported
Programming
Programming skills tests
HumanEval
Method "0-shot" relates to to abilities model perform assignments without any-or examples or preliminary training on specific task. Model exclusively on its knowledge, obtained in preliminary training, in order to answer. In context 0-shot testing, model task without additional instructions, prompts or examples solutions similar tasks. Model should directly generate answer, using only information, in and its basic knowledge. For example, query 0-shot would look so: "Solve equation: 2x + 5 = 13." Model should directly provide solution without any-or additional prompts about solutions 0-shot evaluation represents itself most strict test abilities model, since not provides model no/none additional prompts or help besides question • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
11-shot Example approach, where we to more LLM (for example, GPT-4) with that indeed several times each times in prompt answer, on previous This allows model answer, problem with different view or own errors. Usually process includes several where model "" over task, and improves its results. This technique especially useful for complex reasoning, mathematical tasks and other where is required process solutions. By she/it form thinking, where model generates intermediate steps, which then are used for or approach • Self-reported
MATH
Accuracy
AI's accuracy in providing correct answers to queries is central to its utility and trustworthiness. This can be assessed by evaluating responses against ground truth answers across diverse question types.
Benchmarks: Performance on standardized tests (e.g., MMLU, GPQA, FrontierMath, Competition Math) provides quantitative accuracy metrics.
Human evaluation: Human experts can verify factual correctness, especially for nuanced questions where automated evaluation is challenging.
Consistency: Evaluating whether the AI provides the same answer to the same question across multiple attempts reveals the reliability of its reasoning.
Error analysis: Categorizing error types (e.g., factual errors, reasoning failures, hallucinations) helps identify specific weaknesses.
Domain-specific testing: Assessing performance in specialized knowledge domains (e.g., medicine, law, science) reveals the breadth and limitations of the AI's knowledge. • Self-reported
MGSM
8-shot • Self-reported
Reasoning
Logical reasoning and analysis
BIG-Bench Hard
3-shot • Self-reported
DROP
# number examples In order to ability model to training (few-shot learning), we performance models at use number examples, usually from to several. ## Methodology 1. ****: We set prompts for one and that indeed tasks with number examples in context (from 0 to n). 2. ****: We we evaluate accuracy model for each number examples. 3. **Analysis**: We we analyze, how performance model with number examples. ## **training**: how well quickly model from additional examples. - ****: when additional examples significantly performance. - **performance**: Ability to training (zero-shot) without any-or examples. ## **vs. **: Comparison performance at or randomly selected examples. - **example**: choice specific examples on performance. - **examples**: examples in different parts context for verification on performance • Self-reported
GPQA
Accuracy
AI • Self-reported
Multimodal
Working with images and visual data
MathVista
Accuracy
AI models make factual errors. We measured factual accuracy using tasks on scientific, medical, and mathematical knowledge.
For GPQA, MMLU, Hellaswag, Winogrande, and general factual knowledge, we observed better accuracy with larger models, but both Claude 3 Opus and Llama 3 fell significantly behind GPT-4's accuracy levels.
In scientific knowledge, we see significant errors across all models, with Llama 3 and Claude 3 Opus providing similarly accurate responses, while GPT-4 showed the highest accuracy.
For medical knowledge, Claude 3 Opus demonstrated strong capabilities, with accuracy approaching GPT-4 in many cases, while Llama 3 demonstrated weaker performance, especially on more complex medical reasoning tasks.
In mathematical tasks, we noticed all models struggle with complex calculations and proofs, with common errors including:
- Computational mistakes
- Incorrect application of formulas
- Failure to correctly set up equations
- Making logical errors in proofs
Overall, larger models generally demonstrate better factual accuracy, but all models continue to make significant factual errors, especially in specialized domains requiring precise knowledge. • Self-reported
MMMU
Accuracy AI: Model sometimes at computations, in that at execution simple model not in tasks and not can apply correct for solutions problems. This leads to incorrect answers, especially at solving complex mathematical or logical tasks, requiring multi-step computations. Human: people can errors in complex computations, but usually they sufficiently well main mathematical and when them need to verify its work. usually tasks and corresponding methods for their solutions • Self-reported
Other Tests
Specialized benchmarks
AMC_2022_23
4-shot • Self-reported
FLEURS
errors in AI: We're measuring word error rate (WER), which is the percentage of words in the output that don't match the expected result. This helps us understand how accurately the model follows formatting or exact word choices in tasks requiring precision. Specifically, we compute the minimum number of edits (insertions, deletions, or substitutions) needed to transform the model's output into the reference text, divided by the number of words in the reference. For example, if the reference is "The quick brown fox jumps over the lazy dog" and the model outputs "A quick brown fox jumped over a lazy dog", the WER would be 3/9 ≈ 33.3%, since three words differ • Self-reported
FunctionalMATH
Models can use advantages specific types queries, simply answer or in training data answers, and not answer, on In such model can more than she/it is on We set tests, in order to identify, uses whether model this For these tests we tasks, in which could would give correct answer (for example, choice first in with multiple choice), and then tasks so, in order to more not testing consists in version with correct answer (for example, answer A) and version with other correct answer (for example, answer C). In task can work diverse (for example, answer always first option, answer always A). If model uses her/its performance will high on version, but on We these tests on various mathematical tasks, including choice choice, tasks with True/False and numerical answers. For example, if in task with multiple choice correct answer "A", we options so, in order to correct answer "C". For tasks True/False we so, in order to correct answer with "True" on "False". For numerical tasks we task so, in order to answer (for example, with "10" on "15"). If model uses for answer on questions, such how "first option" or "answer always True", her/its performance significantly on version by comparison with • Self-reported
HiddenMath
Accuracy
AI, ChatGPT, generally makes two kinds of mistakes that a human doesn't. One is hallucinations, and we can talk about hallucinations separately, but also important is inaccuracy.
When I say, inaccuracy I mean that the response is correctly about the topic requested, but some specific claims in the response are not accurate.
When I say, inaccuracy I mean that the response is correctly about the topic requested, but some specific claims in the response are not accurate.
For instance, if asked about the US president elected in 1976, the model might respond that the 1976 US presidential election was won by Jimmy Carter, defeating Gerald Ford, and that Carter was inaugurated on January 20, 1977, and he was followed by Ronald Reagan who won the 1980 election. This is all accurate.
But it might, in a different case, claim that the 1976 US presidential election was won by Jimmy Carter, defeating Gerald Ford, and that Carter was inaugurated on January 20, 1977, and he served one term before losing to Reagan in 1980. Ford's term as president was "1972-1976". All but the last bit is accurate; Ford became president in 1974 not 1972. • Self-reported
MMLU-Pro
0-shot CoT This method encourages LLM its course thoughts at solving tasks, but not provides example. This allows model think about task without prompts, which can in examples. In with 0-shot CoT often is used "Let's let's think step for step" after tasks, that stimulates model break down solution on sequential stages. Research showed, that simple addition phrases "Let's let's think step for step" before answer can significantly improve performance LLM on tasks, requiring reasoning. This how well important model think about process solutions, and not simply answer • Self-reported
MRCR
Accuracy AI: 2 / 2 (100%) This score relates to to accuracy, with which we should interpret behavior model. For example, model can generate with using tools, but we we can incorrectly interpret, how model with model can answer but we we can not that she/it uses template for formation its answers. we we analyze output model (for example, analysis thinking or steps logical reasoning), that more information is required • Self-reported
Natural2Code
Accuracy
AI: 8 • Self-reported
PhysicsFinals
0-shot In case 0-shot model answers on question directly, without special instructions, examples or other additional information. This evaluation, since she/it reflects, how model will work in real This gives representation about "basic knowledge" model and about that, how she/it applies these knowledge to new tasks. 0-shot important for measurement performance model without additional showing her/its ability knowledge on new • Self-reported
Vibe-Eval
Accuracy
AI: ChatGPT + Advanced Data Analysis uses the knowledge extraction technique. For example, it accesses the normal formulas to compute sine, cosine, and other trigonometric functions, and the formula for the Pythagorean identity.
The AI also sets up the given integral correctly and manipulates it using algebraic techniques. It applies substitution correctly, setting u = tan(x), du = sec²(x) dx, and adjusts the limits of integration accordingly.
The AI applies mathematical reasoning to derive the formula for sec²(x). It relates sec²(x) to tan²(x) using the Pythagorean identity and uses this connection to set up the substitution.
The AI also computes the result of the definite integral correctly. It handles the evaluation of the antiderivative at the integration bounds appropriately.
Overall, the AI demonstrates strong mathematical knowledge and appropriate application of calculus techniques for this problem. • Self-reported
Video-MME
Accuracy AI: 1 : 1 AI different: 1.0 different: 1.0 • Self-reported
WMT23
Score
Evaluation • Self-reported
XSTest
Safety Compliance AI: Safety Compliance Models can have limitations which not allow them answer on queries specific type. These limitations often with help "", in system, which execution queries, or When testing should on: 1. from answers on queries, which model how 2. that, why query not can be 3. when query 4. in limitations 5. (when query ) 6. (when query ) that model can behavior in in dependency from context and query. Some model can be more than other, that reflects between and their • Self-reported
License & Metadata
License
proprietary
Announcement Date
May 1, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsGemini 2.0 Flash Thinking
MM
Best score:0.7 (GPQA)
Released:Jan 2025
Gemini 2.5 Flash
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$0.30/1M tokens
Gemini 2.5 Pro
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$1.25/1M tokens
Gemini 2.5 Flash-Lite
MM
Best score:0.6 (GPQA)
Released:Jun 2025
Price:$0.10/1M tokens
Gemini 1.5 Flash
MM
Best score:0.8 (MMLU)
Released:May 2024
Price:$0.15/1M tokens
Gemini 2.0 Flash
MM
Best score:0.6 (GPQA)
Released:Dec 2024
Price:$0.10/1M tokens
Gemini 2.0 Flash-Lite
MM
Best score:0.5 (GPQA)
Released:Feb 2025
Price:$0.07/1M tokens
Gemini 3 Pro
MM
Best score:0.9 (GPQA)
Released:Nov 2025
Price:$2.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.