Key Specifications
Parameters
405.0B
Context
128.0K
Release Date
July 23, 2024
Average Score
79.2%
Timeline
Key dates in the model's history
Announcement
July 23, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
405.0B
Training Tokens
15.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$3.50
Output (per 1M tokens)
$3.50
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
5-shot, macro_avg/acc • Self-reported
Programming
Programming skills tests
HumanEval
0-shot, pass@1 • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
8-shot, CoT, em_maj1@1 In this approach we we use chain thinking (CoT), which with different in order to solve task. We we compute answer by means of decision-making most answer from solutions. More specifically: 1. We 8 various CoT with different 2. In each model over task and generates answer. 3. Then we final answer from each 4. We we choose most common answer (etc.). such approach in that, that he can help errors, which model can in several times and most common answer, we we can with obtain correct answer • Self-reported
MATH
0-chain thinking, final exact answer • Self-reported
Reasoning
Logical reasoning and analysis
DROP
0-shot - this method, at which model artificial intelligence solves new task, not nor one example correct solutions. Model exclusively on its preliminarily knowledge and abilities for execution tasks. For example, if model receives task, she/it should solve her/its, relying on only on its existing knowledge about mathematics, without access to examples solutions similar tasks. This approach verifies basic abilities model and her/its training, and not ability adapt to specific or 0-shot testing is considered one from most methods evaluation and understanding model, since requires application general knowledge to specific task without additional prompts or training • Self-reported
GPQA
prompt or not for that, in order to model tool. You directly ask model execute task, and she/it her/its, using all tools by This approach is for general understanding capabilities model, but can be with points view that, which tools will or Examples queries: - "me about in " - "me about " - "about " • Self-reported
Other Tests
Specialized benchmarks
API-Bank
0-shot ability model perform task without any-or examples, instructions or explanations. Models only question or problem. For example, at solving mathematical tasks: question: "number n such, that n² - 20n + 96 is " Models not are provided or samples solutions. Model should independently approach to solving and execute all necessary steps • Self-reported
ARC-C
testing, or 0-shot, provision model prompts or query without any-or preliminary examples or additional context, and result. This approach evaluates ability model apply its preliminarily obtained knowledge to new task without 0-shot approach especially for evaluation, how well well model can its knowledge in various contexts. 0-shot testing consists in its and efficiency. in examples for each tasks and can give more representation about performance model in real conditions, where users provide context. However 0-shot approach can not fully model, especially in complex or specialized tasks, where was would useful • Self-reported
BFCL
query: below text with on Russian. : (text will ) 0-shot means, that model tries execute task directly from instructions in without additional examples. This approach measures ability model immediately understand task without demonstration that, how she/it should be is few-shot, where we we show model one or several examples execution tasks before that, how she/it should execute task • Self-reported
Gorilla Benchmark API Bench
0-shot
This method not provides no/none demonstrations or training examples for execution tasks. Instead this, model should rely on knowledge, obtained in time preliminary training, in order to generate solutions. • Self-reported
IFEval
# Standard ## What this such testing — most simple and way evaluate model. is used, in order to basic scores performance and comparison results. testing, how consists from specific set queries and tasks, often or in field testing, which are evaluated on match in advance criteria. ## When its use - When you need to quickly compare performance several models for decision-making solutions. - For basic scores, on basis which can conduct more analysis. - When you identify problems before more evaluation. ## and execution. - comparison various models. - Allows data about performance with ## understanding real capabilities model. - Can give representation about complex models. - Not nuances, which can be critically for specific cases use. ## When this testing when you need to understanding capabilities model, especially at evaluation: - reasoning and processes - answers on different queries - to various interpretation queries with various in complex, tasks ## methods - execution standard tests, such how MMLU, GSM8K, HumanEval - with answers - Evaluation accuracy execution simple instructions - Verification base ## testing can improve by means of: - more diverse tasks, various and skills - more complex examples, which capabilities model - tools for evaluation, in order to process - results with time for identification • Self-reported
MBPP EvalPlus
0-shot, base, pass@1 • Self-reported
MMLU (CoT)
0-macro_avg/acc • Self-reported
MMLU-Pro
5-shot, CoT, micro_avg/acc_char • Self-reported
Multilingual MGSM (CoT)
0-shot, CoT, em model in mode chains reasoning (CoT). For 0-shot we use "step by step", without any-or examples. "em" means "exact match" - verification whether model answer, which in accuracy matches with correct answer for given tasks. that in some cases model can to correct but from-for in format or evaluation em can be strict • Self-reported
Multipl-E HumanEval
0-shot, pass@1 AI: This method, which measures, how well often model can solve task with first attempts without examples for training. Model receives task and should provide correct answer with first attempts. For problems with answer, how in mathematics, is considered, that model "" task, if her/its first answer Metric pass@1 measures proportion tasks, which model solved correctly with first attempts. "0-shot" means, that model not are provided examples solutions similar tasks. In difference from few-shot approach, where model several examples tasks with solutions before at 0-shot approach model should rely only on knowledge, obtained in time preliminary training. Metric pass@1 is strict evaluation, since not allows model do several attempts or correct errors. This makes her/its reliability model for where is required accuracy with first attempts • Self-reported
Multipl-E MBPP
0-shot, pass@1 • Self-reported
Nexus
0-shot, macro_avg/acc • Self-reported
License & Metadata
License
llama_3_1_community_license
Announcement Date
July 23, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsLlama 3.3 70B Instruct
Meta
70.0B
Best score:0.9 (HumanEval)
Released:Dec 2024
Price:$0.88/1M tokens
Llama 3.1 70B Instruct
Meta
70.0B
Best score:0.9 (ARC)
Released:Jul 2024
Price:$0.89/1M tokens
Llama 4 Maverick
Meta
MM400.0B
Best score:0.9 (MMLU)
Released:Apr 2025
Price:$0.27/1M tokens
Kimi K2 0905
Moonshot AI
1.0T
Best score:0.9 (HumanEval)
Released:Sep 2025
Price:$0.60/1M tokens
Llama 3.1 Nemotron Ultra 253B v1
NVIDIA
253.0B
Best score:0.8 (GPQA)
Released:Apr 2025
GLM-4.7
Zhipu AI
358.0B
Best score:0.9 (TAU)
Released:Dec 2025
Price:$0.60/1M tokens
Qwen3-Coder 480B A35B Instruct
Alibaba
480.0B
Best score:0.8 (TAU)
Released:Jan 2025
DeepSeek-V3
DeepSeek
671.0B
Best score:0.9 (MMLU)
Released:Dec 2024
Price:$0.27/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.