Key Specifications
Parameters
70.0B
Context
128.0K
Release Date
July 23, 2024
Average Score
74.7%
Timeline
Key dates in the model's history
Announcement
July 23, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
70.0B
Training Tokens
15.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.89
Output (per 1M tokens)
$0.89
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
5-shot • Self-reported
Programming
Programming skills tests
HumanEval
model, such how GPT-4, generate and correct answers, but these answers can on errors or reasoning, which difficult without special knowledge or verification. in advance answer model and explain, that in incorrectly, — this approach to verification understanding and model. If model answer and gives correct explanation, this about her/its abilities critically evaluate information and correct errors. However if model incorrect answer how correct or tries its this can on understanding to information or in training. Such method especially useful for evaluation behavior model in fields, where correct answers and can be for example, in mathematics, or actual information • Self-reported
Reasoning
Logical reasoning and analysis
DROP
shot We we evaluate capabilities model in mode answer on question directly, without examples, instructions and additional context. This allows evaluate basic abilities model. We shot : - model only question or task - model question with about format answer This mode allows verify, how well well model understands and solves tasks, relying on only on its preliminarily trained knowledge. This especially important for evaluation abilities model correctly interpret tasks without additional prompts or examples • Self-reported
GPQA
0-shot In by AI "0-shot" ("shot") relates to to evaluation abilities model machine training perform task without any-or examples or instructions about this specific task. Model is evaluated only on basis her/its abilities apply general training to new task, not special examples. For example, in order to evaluate 0-shot abilities LLM, we we can ask its solve task, which he not without provision samples solutions. 0-shot often with how few-shot approaches, where model are provided several examples before tasks. 0-shot especially important for evaluation abilities model and level her/its understanding domain field • Self-reported
Other Tests
Specialized benchmarks
API-Bank
When 0-shot testing model not receives examples execution tasks with results. Instead this model should rely exclusively on its knowledge, obtained in time preliminary training, for formation answer. This method evaluation shows ability model its knowledge on tasks, with which she/it explicitly not in time training • Self-reported
ARC-C
0-shot For training or evaluation with model receives task without any-or examples or additional information, and should perform her/its, relying on only on its preliminarily obtained knowledge and In difference from approaches with examples (few-shot), where model can on basis several examples, with in 0-shot approach model should rely exclusively on knowledge, obtained in time training. This approach demonstrates ability model and apply its knowledge to new tasks without additional instructions. 0-shot evaluation often is used how way verification basic capabilities model and her/its abilities apply knowledge to tasks, that is score general intelligence and model • Self-reported
BFCL
Standard evaluation AI: texts about models artificial intelligence. whether I help with than-then still? • Self-reported
Gorilla Benchmark API Bench
Method with examples (0-shot) means, that task without provision examples that, how her/its solve. Model uses only instructions (prompt) and should independently understand, how execute assignment. This most complex for model approach, since she/it not receives additional context or examples execution similar tasks. In case examples model exclusively on knowledge, obtained in time preliminary training, and on query. This method often is used for evaluation basic abilities model to and tasks without additional help • Self-reported
GSM-8K (CoT)
8-shot Chain-of-Thought 8-shot Chain-of-Thought (CoT) offers model execute reasoning, from several for answer on question. Examples (usually about 8) include in itself and question, and step-by-step reasoning, to answer. These examples for which demonstrates, how break down complex question on sequence intermediate steps. When LLM presented with new after these examples, he reasoning, sequence steps thinking before provision answer. This method especially efficient for tasks, requiring complex reasoning, such how mathematical tasks, logical puzzles and conclusions. 8-shot CoT in that, that he not requires instructions about that, how reason — instead this model from examples. This allows LLM apply step-by-step thinking to tasks without necessity specialized prompts for each type tasks • Self-reported
IFEval
Standard evaluation AI, Inc and other tests for research performance models at execution various tasks, and results for comparison with other models. by many important tasks and not less, these evaluation have several During-they often evaluate only final answer model, not how she/it to answer. For example, for tasks 97 × 98, some model, such how Claude, can obtain correct answer (9506), but at this use incorrect method solutions (97 × 98 = 97 × 100 - 97 × 2 = 9700 - 194 = 9506). Analysis intermediate steps reasoning can give representation about that, how and why model errors. During-majority evaluations with using basic model and not allow models use capabilities, such how tools or thinking. usually and answers, without capabilities additional queries to model, if answer or In-standard evaluation often in format /without degree correctness answer or model in solving tasks. existing benchmarks less by that, how all more models achieve performance on these tasks. For example, most model achieve in MMLU and other benchmarks • Self-reported
MATH (CoT)
0-shot Chain-of-Thought Chain-of-thought (CoT) model LLM intermediate steps its reasoning, that leads to results at solving tasks, requiring reasoning. For assignments, not requiring reasoning, CoT usually not demonstrates In 0-shot CoT, LLM not receives examples with reasoning — instead this it simply "step by step" (or is used prompt). In difference from this, in few-shot CoT model are provided examples with before than she/it with new task. This method "0-shot CoT", since he not uses examples reasoning, but at this requires prompt, model reason step for step • Self-reported
MBPP ++ base version
assignments 0-shot (attempts) when model directly for solutions tasks without provision it examples for training on basis these examples. This few-shot (several attempts), where model are provided examples correct answer on task before that, how it solve new problem. 0-shot is one from most complex scenarios for model, since from her is required execute task without preliminary training assignments or prompts about that, how structure answer. However this also one from most scenarios use, since he requires with side user. This method often is used how basic level at evaluation performance model, so how he shows, how well well model can apply its knowledge in new contexts without additional performance 0-shot indicates on then, that model understanding tasks and in time preliminary training • Self-reported
MMLU (CoT)
0-shot Chain-of-Thought Chain reasoning without examples (0-shot Chain-of-Thought, CoT) - this method prompting language model break down its process solutions on sequential steps reasoning, not providing examples that, how chain reasoning. In standard approach CoT 0-shot model receives query, "Let's let's think step for step" (or ) before that, how she/it gives its final answer. This allows model execute step-by-step reasoning, which often leads to more exact answers, especially for complex tasks, requiring multi-step reasoning. In difference from few-shot CoT, where model are provided examples step-by-step reasoning, 0-shot CoT on ability model independently generate reasoning without any-or examples. This in modern LLM, which were on various examples reasoning and can apply this to new tasks even without specific examples • Self-reported
MMLU-Pro
5-shot Chain-of-Thought
AI: 5-shot Chain-of-Thought • Self-reported
Multilingual MGSM (CoT)
0-shot Chain-of-Thought
AI: 0-shot Chain-of-Thought • Self-reported
Multipl-E HumanEval
0-shot In our in capacity base settings we we use 0-shot prompts. is we not model examples answers on tasks, and simply her/its directly. For 0-shot questions from GPQA, prompt consists from simple instructions and question: "Question: [question]. Answer:". For solutions tasks by mathematics task how: "Solve following task step for step: [task]". When we we tool use, for example we text in prompt, how can use tool. For example, for : "If in you need to execute you use which need to between <calculator></calculator>. For example, <calculator>12*34</calculator>. Not perform complex computation independently. Instead this " • Self-reported
Multipl-E MBPP
## 0-shot In this model is provided only question, without any-or examples. Model should directly answer on question, not access to examples, correct way answer. This most strict test abilities model follow since she/it should understand, that from is required, only on query • Self-reported
Nexus
In approach with training (0-shot) model uses only query for execution assignments. She/It not receives examples that, how work with task, not can on previous similar tasks and not has capabilities its behavior on basis previous attempts. Model in time preliminary training and She/It should interpret query and answer, only on own basic abilities. This most strict since he evaluates abilities model without which-or additional help or Model not can on examples or prompts, in order to understand, how specifically should or structure answer. Results in mode 0-shot usually than at other but they give most evaluation basic knowledge and reasoning model • Self-reported
License & Metadata
License
llama_3_1_community_license
Announcement Date
July 23, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsLlama 3.3 70B Instruct
Meta
70.0B
Best score:0.9 (HumanEval)
Released:Dec 2024
Price:$0.88/1M tokens
Llama 3.1 405B Instruct
Meta
405.0B
Best score:1.0 (ARC)
Released:Jul 2024
Price:$3.50/1M tokens
Mistral NeMo Instruct
Mistral AI
12.0B
Best score:0.7 (MMLU)
Released:Jul 2024
Price:$0.15/1M tokens
Magistral Small 2506
Mistral AI
24.0B
Best score:0.7 (GPQA)
Released:Jun 2025
LongCat-Flash-Lite
Meituan
68.5B
Best score:0.9 (MMLU)
Released:Feb 2026
Phi 4 Reasoning Plus
Microsoft
14.0B
Best score:0.9 (HumanEval)
Released:Apr 2025
Phi 4 Reasoning
Microsoft
14.0B
Best score:0.9 (HumanEval)
Released:Apr 2025
Hermes 3 70B
Nous Research
70.0B
Best score:0.8 (MMLU)
Released:Aug 2024
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.