Llama 3.1 70B Instruct

Name: Llama 3.1 70B Instruct
Author: Meta

Key Specifications

Parameters

70.0B

Context

128.0K

Release Date

July 23, 2024

Average Score

74.7%

Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

July 23, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

70.0B

Training Tokens

15.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.89

Output (per 1M tokens)

$0.89

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

5-shot • Self-reported

83.6%

Programming

Programming skills tests

HumanEval

model, such how GPT-4, generate and correct answers, but these answers can on errors or reasoning, which difficult without special knowledge or verification. in advance answer model and explain, that in incorrectly, — this approach to verification understanding and model. If model answer and gives correct explanation, this about her/its abilities critically evaluate information and correct errors. However if model incorrect answer how correct or tries its this can on understanding to information or in training. Such method especially useful for evaluation behavior model in fields, where correct answers and can be for example, in mathematics, or actual information • Self-reported

80.5%

Reasoning

Logical reasoning and analysis

DROP

shot We we evaluate capabilities model in mode answer on question directly, without examples, instructions and additional context. This allows evaluate basic abilities model. We shot : - model only question or task - model question with about format answer This mode allows verify, how well well model understands and solves tasks, relying on only on its preliminarily trained knowledge. This especially important for evaluation abilities model correctly interpret tasks without additional prompts or examples • Self-reported

79.6%

GPQA

0-shot In by AI "0-shot" ("shot") relates to to evaluation abilities model machine training perform task without any-or examples or instructions about this specific task. Model is evaluated only on basis her/its abilities apply general training to new task, not special examples. For example, in order to evaluate 0-shot abilities LLM, we we can ask its solve task, which he not without provision samples solutions. 0-shot often with how few-shot approaches, where model are provided several examples before tasks. 0-shot especially important for evaluation abilities model and level her/its understanding domain field • Self-reported

41.7%

Other Tests

Specialized benchmarks

API-Bank

When 0-shot testing model not receives examples execution tasks with results. Instead this model should rely exclusively on its knowledge, obtained in time preliminary training, for formation answer. This method evaluation shows ability model its knowledge on tasks, with which she/it explicitly not in time training • Self-reported

90.0%

ARC-C

0-shot For training or evaluation with model receives task without any-or examples or additional information, and should perform her/its, relying on only on its preliminarily obtained knowledge and In difference from approaches with examples (few-shot), where model can on basis several examples, with in 0-shot approach model should rely exclusively on knowledge, obtained in time training. This approach demonstrates ability model and apply its knowledge to new tasks without additional instructions. 0-shot evaluation often is used how way verification basic capabilities model and her/its abilities apply knowledge to tasks, that is score general intelligence and model • Self-reported

94.8%

BFCL

Standard evaluation AI: texts about models artificial intelligence. whether I help with than-then still? • Self-reported

84.8%

Gorilla Benchmark API Bench

Method with examples (0-shot) means, that task without provision examples that, how her/its solve. Model uses only instructions (prompt) and should independently understand, how execute assignment. This most complex for model approach, since she/it not receives additional context or examples execution similar tasks. In case examples model exclusively on knowledge, obtained in time preliminary training, and on query. This method often is used for evaluation basic abilities model to and tasks without additional help • Self-reported

29.7%

GSM-8K (CoT)

8-shot Chain-of-Thought 8-shot Chain-of-Thought (CoT) offers model execute reasoning, from several for answer on question. Examples (usually about 8) include in itself and question, and step-by-step reasoning, to answer. These examples for which demonstrates, how break down complex question on sequence intermediate steps. When LLM presented with new after these examples, he reasoning, sequence steps thinking before provision answer. This method especially efficient for tasks, requiring complex reasoning, such how mathematical tasks, logical puzzles and conclusions. 8-shot CoT in that, that he not requires instructions about that, how reason — instead this model from examples. This allows LLM apply step-by-step thinking to tasks without necessity specialized prompts for each type tasks • Self-reported

95.1%

IFEval

Standard evaluation AI, Inc and other tests for research performance models at execution various tasks, and results for comparison with other models. by many important tasks and not less, these evaluation have several During-they often evaluate only final answer model, not how she/it to answer. For example, for tasks 97 × 98, some model, such how Claude, can obtain correct answer (9506), but at this use incorrect method solutions (97 × 98 = 97 × 100 - 97 × 2 = 9700 - 194 = 9506). Analysis intermediate steps reasoning can give representation about that, how and why model errors. During-majority evaluations with using basic model and not allow models use capabilities, such how tools or thinking. usually and answers, without capabilities additional queries to model, if answer or In-standard evaluation often in format /without degree correctness answer or model in solving tasks. existing benchmarks less by that, how all more models achieve performance on these tasks. For example, most model achieve in MMLU and other benchmarks • Self-reported

87.5%

MATH (CoT)

0-shot Chain-of-Thought Chain-of-thought (CoT) model LLM intermediate steps its reasoning, that leads to results at solving tasks, requiring reasoning. For assignments, not requiring reasoning, CoT usually not demonstrates In 0-shot CoT, LLM not receives examples with reasoning — instead this it simply "step by step" (or is used prompt). In difference from this, in few-shot CoT model are provided examples with before than she/it with new task. This method "0-shot CoT", since he not uses examples reasoning, but at this requires prompt, model reason step for step • Self-reported

68.0%

MBPP ++ base version

assignments 0-shot (attempts) when model directly for solutions tasks without provision it examples for training on basis these examples. This few-shot (several attempts), where model are provided examples correct answer on task before that, how it solve new problem. 0-shot is one from most complex scenarios for model, since from her is required execute task without preliminary training assignments or prompts about that, how structure answer. However this also one from most scenarios use, since he requires with side user. This method often is used how basic level at evaluation performance model, so how he shows, how well well model can apply its knowledge in new contexts without additional performance 0-shot indicates on then, that model understanding tasks and in time preliminary training • Self-reported

86.0%

MMLU (CoT)

0-shot Chain-of-Thought Chain reasoning without examples (0-shot Chain-of-Thought, CoT) - this method prompting language model break down its process solutions on sequential steps reasoning, not providing examples that, how chain reasoning. In standard approach CoT 0-shot model receives query, "Let's let's think step for step" (or ) before that, how she/it gives its final answer. This allows model execute step-by-step reasoning, which often leads to more exact answers, especially for complex tasks, requiring multi-step reasoning. In difference from few-shot CoT, where model are provided examples step-by-step reasoning, 0-shot CoT on ability model independently generate reasoning without any-or examples. This in modern LLM, which were on various examples reasoning and can apply this to new tasks even without specific examples • Self-reported

86.0%

MMLU-Pro

5-shot Chain-of-Thought AI: 5-shot Chain-of-Thought • Self-reported

66.4%

Multilingual MGSM (CoT)

0-shot Chain-of-Thought AI: 0-shot Chain-of-Thought • Self-reported

86.9%

Multipl-E HumanEval

0-shot In our in capacity base settings we we use 0-shot prompts. is we not model examples answers on tasks, and simply her/its directly. For 0-shot questions from GPQA, prompt consists from simple instructions and question: "Question: [question]. Answer:". For solutions tasks by mathematics task how: "Solve following task step for step: [task]". When we we tool use, for example we text in prompt, how can use tool. For example, for : "If in you need to execute you use which need to between <calculator></calculator>. For example, <calculator>12*34</calculator>. Not perform complex computation independently. Instead this " • Self-reported

65.5%

Multipl-E MBPP

## 0-shot In this model is provided only question, without any-or examples. Model should directly answer on question, not access to examples, correct way answer. This most strict test abilities model follow since she/it should understand, that from is required, only on query • Self-reported

62.0%

Nexus

In approach with training (0-shot) model uses only query for execution assignments. She/It not receives examples that, how work with task, not can on previous similar tasks and not has capabilities its behavior on basis previous attempts. Model in time preliminary training and She/It should interpret query and answer, only on own basic abilities. This most strict since he evaluates abilities model without which-or additional help or Model not can on examples or prompts, in order to understand, how specifically should or structure answer. Results in mode 0-shot usually than at other but they give most evaluation basic knowledge and reasoning model • Self-reported

56.7%

License & Metadata

License

llama_3_1_community_license

Announcement Date

July 23, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Llama 3.3 70B Instruct

Llama 3.1 405B Instruct

Mistral NeMo Instruct

Mistral AI

12.0B

Best score:0.7 (MMLU)

Released:Jul 2024

Price:$0.15/1M tokens

Magistral Small 2506

Mistral AI

24.0B

Best score:0.7 (GPQA)

Released:Jun 2025

LongCat-Flash-Lite

Meituan

68.5B

Best score:0.9 (MMLU)

Released:Feb 2026

Phi 4 Reasoning Plus

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Phi 4 Reasoning

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Hermes 3 70B

Nous Research

70.0B

Best score:0.8 (MMLU)

Released:Aug 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.