Key Specifications
Parameters
-
Context
200.0K
Release Date
March 13, 2024
Average Score
71.5%
Timeline
Key dates in the model's history
Announcement
March 13, 2024
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.25
Output (per 1M tokens)
$1.25
Max Input Tokens
200.0K
Max Output Tokens
200.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
10-AI: 10-shot • Self-reported
MMLU
5-shot • Self-reported
Programming
Programming skills tests
HumanEval
0-shot Mode "0-shot" relates to to evaluation model without provision examples that, how perform task. Models is provided only instruction or query, and she/it should generate answer without training examples. This method evaluation shows ability model perform task, exclusively on knowledge, obtained in time preliminary training, and without additional context or examples execution specific tasks. 0-shot testing especially important for measurement general capabilities model and her/its abilities follow instructions without additional prompts. This most strict evaluation, since he requires from model knowledge on new task without examples • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
0-shot CoT Method "0-shot Chain of Thought" (0-shot CoT) — this approach, at which model ask "step by step" at solving tasks, not providing examples such reasoning. simple way 0-shot CoT — phrase "let's let's think step by step" to query. This encourages model generate chain logical reasoning before that, how give final answer. In difference from queries, where model can immediately answer, 0-shot CoT stimulates model break down complex problem on more managed parts, that often leads to more exact results, especially in complex tasks, such how mathematical computation or logical puzzles. 0-shot CoT in that, that he not requires for tasks examples with reasoning, that makes this method more by comparison with few-shot CoT • Self-reported
MATH
0-shot CoT Chain thinking without preliminary training This method model step by step reasoning without provision examples such reasoning. Despite on examples, to model "step by step" before provision answer often significantly improves performance. This approach especially useful in situations, when example or when tasks too in order to their examples. Query on step-by-step reasoning encourages model and structure its answer, that often leads to more exact results. How and other methods reasoning, 0-shot CoT substantially for more large models, since they better instructions and can generate more complex reasoning • Self-reported
MGSM
When model all necessary about task, but that-then it simply correct answer, we error output. In order to evaluate ability model correct answers from at her information, we we use tasks, which require logical reasoning about in Examples include tasks on with puzzles with and problems, which can solve For example, we we can give model prompt: "most ?" information, for answer, in If model gives incorrect answer, this about that, that she/it not correctly execute output. This errors differs from errors knowledge, where model could would answer correctly at specific information. at output model already has information, but that-then in her/its or process generation answer • Self-reported
Reasoning
Logical reasoning and analysis
BIG-Bench Hard
3-shot CoT Reasoning by chain (Chain-of-Thought, CoT) with using three examples is method improvements reasoning model LLM through examples step-by-step solutions problems. This standard prompting with several examples, but with important : each example not simply shows but and demonstrates intermediate steps reasoning. In 3-shot CoT (reasoning by chain with three examples) we we provide model three example reasoning, where for each example : 1. Task/question 2. reasoning, course thoughts 3. answer This method especially efficient for mathematical tasks, logical puzzles and tasks, requiring reasoning. three examples usually ensures context, in order to model template reasoning, at this not Research show, that model, trained with help CoT, often demonstrate improvement in solving complex tasks by comparison with since they break down problems on more managed steps and reason sequentially • Self-reported
DROP
3-shot, F1 score Metric F1 evaluates performance model for solutions mathematical tasks, requiring several steps reasoning. In task with 3-shot model are provided three example solutions before that, how she/it to new task. F1-measure represents itself harmonic average between accuracy (precision) and (recall). She/It especially useful for data, where important how so and results. In context mathematical tasks F1 score measures, how well well model can correct steps reasoning and to correct answers, capability number examples from sample • Self-reported
GPQA
0-shot CoT Model intermediate steps reasoning for obtaining answer, not special instructions in prompt. This when model solve task and in solutions course its thoughts. Method differs from 0-shot that, that model not simply immediately answer, and its solutions. that, this differs from chain-of-thought (chains reasoning), where in prompt is "let's let's think step for step". In 0-shot CoT model independently solves show intermediate steps without query. Example: if question about solving mathematical tasks, model not only gives answer, but and shows stages solutions, although in prompt not was explain course solutions • Self-reported
Other Tests
Specialized benchmarks
ARC-C
25-shot Method 25-shot (25 examples) — this technique, at which we we provide model AI 25 examples previous answers or solutions tasks before that, how model solve new task. This approach especially useful for settings model on format or answer and usually gives results, than methods with number examples, such how 0-shot (without examples) or few-shot (several examples). In our research we used 25-shot for improvements performance models on complex mathematical tasks from competitions level AIME and FrontierMath. model 25 fully solved tasks with detailed we significantly ability model methods solutions and follow specific reasoning. method 25-shot consists in that, that he gives model sufficiently context for identification in not at this modern LLM. However this method requires examples, which should be for tasks • Self-reported
License & Metadata
License
proprietary
Announcement Date
March 13, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsClaude Sonnet 4
Anthropic
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$3.00/1M tokens
Claude Opus 4
Anthropic
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$15.00/1M tokens
Claude 3.7 Sonnet
Anthropic
MM
Best score:0.8 (GPQA)
Released:Feb 2025
Price:$3.00/1M tokens
Claude 3 Sonnet
Anthropic
MM
Best score:0.9 (ARC)
Released:Feb 2024
Price:$3.00/1M tokens
Claude 3.5 Sonnet
Anthropic
MM
Best score:0.9 (HumanEval)
Released:Oct 2024
Price:$3.00/1M tokens
Claude Sonnet 4.6
Anthropic
MM
Best score:0.9 (GPQA)
Released:Feb 2026
Price:$3.00/1M tokens
Claude Opus 4.6
Anthropic
MM
Best score:1.0 (TAU)
Released:Feb 2026
Price:$5.00/1M tokens
Claude Sonnet 4.5
Anthropic
MM
Best score:0.9 (TAU)
Released:Sep 2025
Price:$3.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.