Key Specifications
Parameters
-
Context
1.0M
Release Date
May 20, 2025
Average Score
62.5%
Timeline
Key dates in the model's history
Announcement
May 20, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
January 31, 2025
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.30
Output (per 1M tokens)
$2.50
Max Input Tokens
1.0M
Max Output Tokens
65.5K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Programming
Programming skills tests
SWE-Bench Verified
Accuracy AI: [query or question]. Human: [answer on question AI]. When evaluation accuracy you should determine, whether human on query AI correctly and exactly. Answers human can be (for example, only part answer) or include information. goal — determine, is whether information, actually Some queries can be so, in order to human information or In these cases on basis that, how well and answer, since actual accuracy determine If human indicates, that not answer or answer, this should consider how answer (cases, when from-for query information) • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Pass@1 In context models (LLM) "Pass@1" — this metric evaluation performance model at solving tasks programming. Definition: Pass@1 measures proportion tasks programming, which model solves correctly with first attempts, when generates only one solution. : - Models task programming - Model generates one solution - Solution is evaluated on all test cases - Pass@1 = (number tasks, solved with first attempts) / (general number tasks) This metric important, because that she/it evaluates ability model generate solution without necessity several attempts or iterations. High score Pass@1 indicates on then, that model understands tasks and can code with first times. In difference from Pass@k (where k > 1), which allows model generate several solutions and if although would one from them works, Pass@1 is more strict metric, accuracy with first attempts • Self-reported
Multimodal
Working with images and visual data
MMMU
Pass@1 In context evaluation efficiency models artificial intelligence at solving complex tasks, especially with and reasoning, "Pass@1" represents itself metric, which determines probability that, that model can correctly solve problem with first attempts. When Pass@1 model provides one solution for each tasks. Then these solutions automatically in order to determine, correctly whether they solve task. correctly solved tasks with first attempts gives value Pass@1. This metric especially for evaluation "" abilities model solve problems without use such methods, how or sample. She/It measures base ability model generate correct solutions without necessity in several attempts or High score Pass@1 indicates on then, that model understanding field and can sequentially generate correct solutions with first attempts, that is important score her/its general efficiency and • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
Standard evaluation • Self-reported
Aider-Polyglot Edit
Diff-Fenced Method, abilities and limitations language models, requiring reasoning or examples. in depends from obtaining answers on complex questions, which then are used how This by means of "" tasks, which model should execute. we model: 1. First solve more question; 2. answer for question from set generated each from which was with help one and that indeed solutions. We we compute in performance model in these by comparison with when it directly solve task. We this method on example mathematical questions with several options answers and tasks on reasoning, at this showing improvement on 33 for GPT-4 Turbo by comparison with direct approach to such tasks • Self-reported
AIME 2024
Pass@1 Metric Pass@1 measures percentage tasks, which model can solve with first attempts. This basic score performance model at without capabilities corrections errors. When evaluation with help Pass@1 model receives one attempt on solution each tasks. Model or correctly solves task (1 score), or no (0 points). Then these points by all tasks for obtaining general Pass@1. Important note, that Pass@1 not accounts for model to correct answer — this strict metric (correctly/incorrectly). She/It also not accounts for reasoning or intermediate steps, for obtaining answer. Although Pass@1 is base metric, she/it not reflects, how well model can improve its answers at several attempts, how this make such metrics, how Pass@k or with • Self-reported
AIME 2025
Pass@1 Metric Pass@1 evaluates proportion problems, which model solves with first attempts. This most strict metric, since she/it not provides model capabilities errors or improve its answer several attempts. For computation Pass@1 we we use our majority voting (), which probability that, that one sample from set will This method accounts for correct and answers model on task. if model generates n samples for tasks and k from them Pass@1 is calculated how: Pass@1 = k/n Pass@1 consists in its and : he directly measures ability model give correct answers with first attempts, that has value for many • Self-reported
FACTS Grounding
Accuracy
AI • Self-reported
Global-MMLU-Lite
Accuracy In field mathematics and solutions tasks accuracy has value. She/It measures ability model correct answers on before her questions. Accuracy can evaluate by various : simple verification answer, evaluation intermediate steps in solving tasks, or that, can whether reasoning model lead to to model AI should not only provide correct answers, but and do this with number errors in its computations and In complex mathematical tasks these model should accuracy on many steps reasoning, correctly apply concepts and computational errors and, in to model machine training are evaluated on mathematical benchmarks complexity: from base to tasks level AIME and IMO. These evaluation accuracy understand, how well effectively model can with tasks and where improvements • Self-reported
Humanity's Last Exam
Accuracy AI: GPT-4o Human: by texts about models artificial intelligence. human. Translate following text descriptions method analysis model AI on Russian language, rules: 1. on language. 2. all and in form (for example: GPT, LLM, API, AIME, GPQA). (for example: "thinking mode" → "mode thinking", "tools" → "tools"). 3. and 4. descriptions. 5. Not information, only then, that maintaining all details. 6. models (for example "GPT-5 nano", "Claude") on 7. benchmarks and on (for example: "AIME", "FrontierMath", "Harvard-MIT Mathematics Tournament"). 8. should be on text, 9. explanations, quotes or — on ONLY translation • Self-reported
LiveCodeBench v5
Pass@1 Pass@1 — this metric, which measures, which percentage tasks model can solve with first attempts. In difference from evaluation "Solve Rate", which evaluates probability solutions tasks after several attempts, Pass@1 measures model in generation correct answer with first times. This metric especially important in scenarios, where users and exact results. She/It reflects ability model solve tasks without necessity attempts, that makes her/its score reliability and accuracy model. In context mathematical and tasks, Pass@1 is strict metric, since requires, in order to answer was fully correct with first attempts. This makes Pass@1 especially for evaluation models in fields, where errors can have More score Pass@1 means, that model more and requires iterations for achievements correct answer, that time and computational • Self-reported
MRCR
1M-pointwise This method comparison models, at which for set from diverse assignments and queries we one and that indeed query two various models (for example, Claude 3 Opus and GPT-4) and we ask experts evaluate, which answer better. This method ensures understanding strong and sides models. Advantages this method: - He measures capabilities and gives representation about performance. - He verifies ability models perform very diverse tasks, which can for benchmarks. - He includes real queries users, that makes its more for use. Disadvantages method: - results can be complex from-for data. - Evaluation different queries requires resources. - between models can be minor, that about their • Self-reported
SimpleQA
Accuracy
AI • Self-reported
Vibe-Eval
Accuracy percentage or proportion correct answers, model on set test tasks. This most common score efficiency for many tasks, especially those, which have specific correct answers. Accuracy = (Number correct answers) / (number questions) Advantages: - understanding and interpretation - for tasks with clearly correct answers - Allows directly compare different model or approaches Limitations: - Not accounts for complexity various questions - Not reflects confidence model in its answers - Can be for data - Not gives representations about errors • Self-reported
License & Metadata
License
proprietary
Announcement Date
May 20, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsGemini 3 Flash
MM
Best score:0.9 (GPQA)
Released:Dec 2025
Price:$0.50/1M tokens
Gemini 1.5 Pro
MM
Best score:0.9 (MMLU)
Released:May 2024
Price:$2.50/1M tokens
Gemini 2.5 Pro
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$1.25/1M tokens
Gemini 2.5 Pro Preview 06-05
MM
Best score:0.9 (GPQA)
Released:Jun 2025
Price:$1.25/1M tokens
Gemini 2.5 Flash-Lite
MM
Best score:0.6 (GPQA)
Released:Jun 2025
Price:$0.10/1M tokens
Gemini 3 Pro
MM
Best score:0.9 (GPQA)
Released:Nov 2025
Price:$2.00/1M tokens
Gemini 3.1 Pro
MM
Best score:0.9 (GPQA)
Released:Feb 2026
Price:$2.50/1M tokens
Gemini 2.0 Flash Thinking
MM
Best score:0.7 (GPQA)
Released:Jan 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.