Llama 3.3 70B Instruct

Name: Llama 3.3 70B Instruct
Author: Meta

Key Specifications

Parameters

70.0B

Context

128.0K

Release Date

December 6, 2024

Average Score

79.9%

API Documentation Repository Model Weights

Timeline

Key dates in the model's history

Announcement

December 6, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

70.0B

Training Tokens

15.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.88

Output (per 1M tokens)

$0.88

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

# 0-shot CoT 0-shot Chain-of-Thought (0-shot CoT) — this method, which offers models break down solution tasks on sequential steps reasoning, even if in no examples such approach. in Kojima et al. (2022), this method that language model can step-by-step reasoning without examples, simply in query phrases "Let's let's think step for step". In difference from other approaches, such how few-shot CoT (examples reasoning) or zero-shot CoT (where model should understand, that is required step-by-step reasoning), 0-shot CoT uses prompt. This makes its especially for tasks, where examples or This method efficient for: - mathematical tasks - logical output - reasoning - where is required explanation thoughts Research show, that 0-shot CoT significantly improves performance models on tasks, requiring thinking, without necessity model • Self-reported

86.0%

Programming

Programming skills tests

HumanEval

# with (Automatic Temperature Sweeping) We developed new method, Automatic Temperature Sweeping (ATS), which for each question. This allows model find for each tasks. ## method 1. For each question model automatically several prompts with using different values temperature: - (0.0-0.3): accuracy and (0.4-0.7): between and accuracy - (0.8-1.0): maximum and thinking 2. Model generates answers on one and that indeed question with different 3. Then analysis all generated answers for: - general in answers - most solutions - cases, when more to errors or 4. On basis this analysis model answer or new answer, aspects various ## Advantages - ****: value temperature for each type tasks - ****: confidence in correctness answer by means of verification between **thinking**: allows model exact computation with ## Application in mathematical tasks method especially efficient for mathematical tasks, where: - temperature accuracy computations - temperature structure approach to solving - temperature solutions ATS already improvements on complex benchmarks, including MATH, GSM8K and GPQA, especially in tasks, requiring how accuracy, so and thinking • Self-reported

88.4%

Mathematics

Mathematical problems and computations

MATH

0-shot CoT Method "thinking by chain" (chain-of-thought) without examples allows models solve complex tasks by means of generation intermediate steps reasoning. In difference from prompting, which can give only final answer, 0-shot CoT process solutions on logical stages. includes two key : 1. question or task 2. "Let's let's think step for step" (or phrase) This simple substantially ability LLM to solving tasks even without demonstration examples. Model sequentially chain reasoning, that especially useful in fields: - computation - puzzles - Tasks on reasoning - questions, requiring Research show, that 0-shot CoT especially efficient for modern models with number such how GPT-4, Claude and PaLM. application makes this method tool for users, not special skills prompting or examples. However efficiency can in dependency from complexity tasks and capabilities specific model • Self-reported

77.0%

MGSM

# Evaluation with help mode thinking (Chain-of-Thought) Approach chain-of-thought (mode thinking) helps improve reasoning LLM (large language models) for intermediate steps at solving problems. Usually for evaluation quality LLM answer with correct. However intermediate steps provide information: they can that model to answer by incorrect or to answer through in whole process with ## between reasoning and answers We we use tasks, requiring reasoning, in order to determine between accuracy answer and reasoning: 1. Tasks reasoning - GPQA (results) - Set mathematical tasks level and 2. tasks - Tasks ## evaluation We we use evaluation for mode thinking model: 1. **Evaluation answer**: whether answer correct and ? 2. **Evaluation reasoning**: How well correct and is reasoning model? - 1: Reasoning has errors, which to answer - 2: Reasoning has errors, but model randomly receives correct answer - 3: Reasoning in whole but contains errors, which to answer - 4: Reasoning in whole correct, with minor errors, but model receives correct answer - 5: Reasoning fully correct and leads to correct answer ## Results this evaluation, we we can better understand: - Answers model by correct (evaluation 5) - answers, on errors in reasoning (evaluation 4) - answers from-for errors in whole approach (evaluation 3) - correct answers through reasoning (evaluation 2) - • Self-reported

91.1%

Reasoning

Logical reasoning and analysis

GPQA

0-shot CoT In this method we we ask LLM follow "thinking" (Chain-of-Thought), at which model step by step its course reasoning. In difference from CoT, where model is provided that, how structure its thoughts, 0-shot CoT offers model independently determine structure answer, often using simple prompts, such how "let's let's think step for step". This method first was presented in research "Large Language Models are Zero-Shot Reasoners" and improvement performance in various tasks reasoning without necessity provision examples. He especially from-for its and efficiency. In such model usually first task, then her/its on subtasks, solves their sequentially and, finally, combines results for formation final answer • Self-reported

50.5%

Other Tests

Specialized benchmarks

BFCL v2

# testing : for verification work complex tools in LLM ## In new LLM improve execution complex tasks, requiring skills. For this were "thinking" - prompts, instructions or which indicate LLM use specific strategies reasoning for solutions tasks. Examples include mode programming, mode mathematical evidence, mode deep analysis and so further. However existing benchmarks not evaluate these directly. Usually new model on complex tasks (for example, GPQA for evaluation understanding in scientific fields, FrontierMath for mathematics), but evaluation on correctness answers, and not on work We we present new approach to which on verification various thinking. This approach has three main advantages: 1. **Evaluation **: If model works well or on tasks expert level, we which specific components or thinking correctly. 2. ****: By that how model more complex, with more we should have capability track, whether specific components with time. 3. **Definition capabilities**: We should abilities model in specific thinking, and not only in tasks. ## Methodology We we offer structured approach to "": 1. **Definition mode**: mode thinking, which you (for example, programming, mathematics, ). 2. **examples ""**: examples execution tasks in this mode from experts-people. 3. **key **: specific which execution in this mode (for example, verification cases at ). 4. **tasks**: • Self-reported

77.3%

IFEval

# Comparison mathematical capabilities various language models with This benchmark more 300 tasks by mathematics and level complexity. Models solve tasks, using chain reasoning without additional tools. results: - GPT-4o demonstrates over models, 91% tasks (more 270) - Claude 3 Opus solves about 68% tasks - Models size show improvements, but all still from ## methodology We set from 300+ tasks level complexity: - arithmetic, numbers, Tasks from competitions from 5 to level tasks from AIME, competitions by mathematics, FrontierMath and other ## testing 1. Models use mode "chains thinking" with solve task step for step 2. We models 3 attempts for each tasks 3. Solution is evaluated how correct only if correct answer 4. approaches to solving if they to correct answer ## that more new model with significantly ability to solving mathematical tasks. GPT-4o in 91% tasks from to level complexity. between models and models level although more new model (for example, Claude 3 Sonnet) show improvements by comparison with • Self-reported

92.1%

MBPP EvalPlus

# First directly model For provision models capabilities "" before their abilities solve complex tasks, we we offer following approach: first model directly, can whether she/it solve task, and why she/it that capable or not capable this make. This allows us better understand process reasoning model and determine, correctly whether model evaluates its limitations. ## Questions, which should : 1. "whether you solve this task?" 2. "you that (or not ) solve this task?" 3. "approach you would for solutions this tasks?" 4. "tools or strategies reasoning ?" 5. "How would you its confidence in solving this tasks?" ## Advantages method: * cases (model that not can solve task, although on can) * (model that can solve task, but not can) * representation about reasoning model at approach to solving * where and why model can at solving tasks ## Application method on : After obtaining model her/its solve task. Then can compare actual performance with This more full capabilities and limitations model • Self-reported

87.6%

MMLU-Pro

0-shot CoT Zero-shot Chain-of-Thought (0-shot CoT) - this method, for improvements reasoning language models without use examples. He was first presented in work Kojima et al. (2022), "Large Language Models are Zero-Shot Reasoners". In standard prompt 0-shot CoT to query phrase "Let's let's think step for step" (or "Let's think step by step" in version). This simple addition encourages model explicitly process its reasoning instead that, in order to immediately to answer. Research showed, that 0-shot CoT significantly improves performance language models in tasks, requiring reasoning, such how tasks, tasks on logical thinking and Method works, since encourages model break down complex problem on more simple steps, which she/it can solve sequentially. Key advantages 0-shot CoT: - Not requires examples or additional training - in to tasks - Can with sufficiently language model When this 0-shot CoT can be less efficient, than few-shot CoT (with examples) for especially complex tasks, but its and make its tool for improvements reasoning language models • Self-reported

68.9%

License & Metadata

License

llama_3_3_community_license_agreement

Announcement Date

December 6, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Llama 3.1 70B Instruct

Llama 3.1 405B Instruct

Phi 4 Reasoning Plus

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Phi 4 Reasoning

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Hermes 3 70B

Nous Research

70.0B

Best score:0.8 (MMLU)

Released:Aug 2024

Phi 4

Microsoft

14.7B

Best score:0.8 (MMLU)

Released:Dec 2024

Price:$0.07/1M tokens

Magistral Small 2506

Mistral AI

24.0B

Best score:0.7 (GPQA)

Released:Jun 2025

ERNIE 4.5

Baidu

21.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.