Key Specifications
Parameters
-
Context
200.0K
Release Date
February 29, 2024
Average Score
73.8%
Timeline
Key dates in the model's history
Announcement
February 29, 2024
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$3.00
Output (per 1M tokens)
$15.00
Max Input Tokens
200.0K
Max Output Tokens
200.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
10-shot In 10-shot prompt, question first LLM with 10 examples, correct answers. Then target question, on which LLM should answer. Examples are and that, how solve problem. They should be so, in order to diverse cases in solutions, and not should be too on or on target question. 10-shot prompt often gives results, than 0-shot and 1-shot methods, and to efficiency fine-tuning for some tasks. However at him is 10-shot prompt part that and can create for with 10-shot prompt useful, when task requires demonstration diverse methods solutions, but not so when task or when can instructions for solutions tasks without examples • Self-reported
MMLU
5-shot • Self-reported
Programming
Programming skills tests
HumanEval
For research capabilities models we we use shot (0-shot), which means, that model not receives examples solutions tasks before that, how her/its ask execute assignment. We this approach by : 1. This use for majority people, with models; 2. This most strict test abilities model, not allowing it simply solutions from examples; 3. This ensures evaluation capabilities model without additional prompts or ; 4. Such approach answers in examples. For complex tests, such how GPQA, shot especially important, since provision examples can solutions or that evaluation. Using only question without examples, we we receive more evaluation basic knowledge and reasoning model • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
0-shot CoT Method analysis, which allows models think step for step for solutions tasks. In difference from method with with examples, in this approach model not receives examples thinking, and instead this her/its simply ask reason before provision final answer. Usually model receive instruction, : "Let's solve this task step for step". This instruction allows model break down complex task on more components, that leads to results by comparison with query answer. Efficiency 0-shot CoT often is evaluated by comparison with without reasoning on various mathematical and logical tasks. Research show, that such to significantly improves performance model, especially at solving complex tasks • Self-reported
MATH
# 0-shot CoT Method "0-shot CoT" (Chain thinking without examples) based on that can encourage model think step by step over solution complex tasks, not showing it specific examples such step-by-step solutions. ## Approach includes in itself addition simple prompts, such how "Let's let's solve this step for step", to This approach model break down solution on sequential reasoning instead that, in order to immediately final answer. ## Advantages - ****: Not requires creation examples demonstrations for training model. - ****: Can to various tasks and models. - **Efficiency**: Can significantly improve performance model on complex tasks, requiring logical reasoning. ## Limitations - Efficiency depends from basic abilities model to reasoning. - Can work not so well, how few-shot CoT for some specific types tasks. - Quality reasoning and accuracy answers can in dependency from prompts. ## Examples use ``` Task: At was 5 He 2 apples and 3 apples from at him ? Query with 0-shot CoT: Let's let's solve this step for step. ``` ## Application 0-shot CoT especially useful for: - testing abilities model to reasoning - when at you no time or resources for creation examples - improvements performance on diverse tasks • Self-reported
MGSM
0-shot simple and method measurement performance LLM on task consists in that, in order to simply provide task in capacity prompt without additional instructions or examples. this approach is then, that he not uses ability LLM adapt to specific or solutions, through this method is used for between models, especially when model not have context, in order to more complex approaches. This also • Self-reported
Reasoning
Logical reasoning and analysis
BIG-Bench Hard
3-shot CoT In this mode instructions ask model generate chain reasoning for three examples, before than to For each from examples model receives solution and justification, and then should apply reasoning to 3-shot CoT process reasoning, model, how break down complex task on more managed steps. several examples, this method helps model identify corresponding templates and strategy solutions. this approach consists in its abilities performance on requiring step-by-step thinking, such how mathematical tasks, logical puzzles and other tasks, where direct output can be for achievements solutions • Self-reported
DROP
3-shot, F1 score F1-evaluation — this measure accuracy, which represents itself harmonic average between accuracy (precision) and (recall). F1-evaluation provides metric for evaluation between accuracy and She/It especially useful, when In 3-shot F1-evaluation model makes after that, how it showed 3 example (3 "" or "attempts"). This way measurement that, how well well model can on basis number examples, that important for evaluation abilities training • Self-reported
GPQA
0-shot CoT - Diamond Zero-shot Chain of Thought (0-shot CoT) - this approach to solving tasks, which encourages model provide step-by-step reasoning, prompt "let's let's think step for step" before query to model. This method allows models structure complex reasoning without examples reasoning, that usually leads to results for tasks, requiring several steps thinking. Diamond process, solution in format: first problem on information and goal, then sequentially solves problem, stages reasoning. Diamond on main solutions, details, that especially useful for complex mathematical or tasks. We we use Diamond 0-shot CoT for evaluation, since he effectively thinking model without necessity in examples, which could would model to specific solutions • Self-reported
Other Tests
Specialized benchmarks
ARC-C
25-shot • Self-reported
MMLU-Pro
0-shot CoT 0-shot Chain-of-Thought (CoT) — this method, which offers models think, before than give final answer. This by means of prompts, such how "Let's let's think step for step" before query answer. In difference from few-shot CoT, which provides examples chains reasoning, 0-shot CoT not requires no/none examples. 0-shot CoT is one from most methods improvement performance LLM. He substantially improves ability models solve tasks, tasks on and tasks. Method especially efficient for modern LLM with to reasoning, such how GPT-4. that, 0-shot CoT for various methods reasoning. For example, he allows models apply strategies verification their own solutions and • Self-reported
License & Metadata
License
proprietary
Announcement Date
February 29, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsClaude 3.7 Sonnet
Anthropic
MM
Best score:0.8 (GPQA)
Released:Feb 2025
Price:$3.00/1M tokens
Claude 3.5 Sonnet
Anthropic
MM
Best score:0.9 (HumanEval)
Released:Oct 2024
Price:$3.00/1M tokens
Claude 3 Haiku
Anthropic
MM
Best score:0.9 (ARC)
Released:Mar 2024
Price:$0.25/1M tokens
Claude Sonnet 4
Anthropic
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$3.00/1M tokens
Claude Opus 4
Anthropic
MM
Best score:0.8 (GPQA)
Released:May 2025
Price:$15.00/1M tokens
Claude Haiku 4.5
Anthropic
MM
Best score:0.8 (TAU)
Released:Oct 2025
Price:$1.00/1M tokens
Claude Sonnet 4.6
Anthropic
MM
Best score:0.9 (GPQA)
Released:Feb 2026
Price:$3.00/1M tokens
Claude 3.5 Sonnet
Anthropic
MM
Best score:0.9 (HumanEval)
Released:Jun 2024
Price:$3.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.