Llama 3.2 3B Instruct

Name: Llama 3.2 3B Instruct
Author: Meta

Key Specifications

Parameters

3.2B

Context

128.0K

Release Date

September 25, 2024

Average Score

55.6%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

September 25, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

3.2B

Training Tokens

9.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.01

Output (per 1M tokens)

$0.02

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

0-shot, accuracy ChatBot: AI • Self-reported

69.8%

MMLU

5-shot, macro_avg/acc • Self-reported

63.4%

Mathematics

Mathematical problems and computations

GSM8k

8-attempts, em_maj1@1 AI: ChloeAI Prompt: We use 8 shots of previous QA pairs in a retrieval setting, where we retrieve relevant context by embedding similarity. We define test accuracy as the majority vote (maj1) of the model's answers over all 8 trials for a single exact match (em). This aggregates over potential randomness in responses. • Self-reported

77.7%

MATH

0-shot, final_em For each example we used one query to model. We not answers model, final extraction answer. answer (final_em): We we determine answer how process answer from model after that, how she/it fully solution tasks. For this we following : (1) If answer already in format (for example, "The answer is 42"), we final answer (in given case "42"). (2) If task in format with multiple choice, and model indicates option (for example, "(A)"), we this option. (3) In case we final answer in or model. If several numbers, we number • Self-reported

48.0%

MGSM

Chain thinking, em AI: I don't understand the "em" in this text. Let me reason about this step by step. In the context of prompt engineering and AI methods, "CoT" clearly refers to "Chain of Thought", which is a prompting technique where the model is encouraged to break down its reasoning into sequential steps. The "em" could potentially refer to: 1. "Expectation maximization" - a statistical algorithm 2. "em" as in emphasis in HTML/markdown (like *this*) 3. Some kind of metric or modifier related to CoT 4. A typo or abbreviation for something else Since this is just a two-word fragment without context, the most likely interpretation is that it's referring to Chain of Thought reasoning with some kind of "em" qualifier or metric associated with it. But without more context, I can only provide this basic translation of the terms as they appear. • Self-reported

58.2%

Reasoning

Logical reasoning and analysis

GPQA

0-shot, accuracy AI: In this category, we compute the accuracy of the model's predictions on our pre-determined list of questions directly from the model's top 1 output, without any prompting or support. • Self-reported

32.8%

Other Tests

Specialized benchmarks

ARC-C

0-shot, acc standard 0-shot evaluation without which-or additional information or demonstrations. Accuracy is calculated on set, in order to Since methodology 0-shot is used how in evaluation, so and at testing in this match between evaluation and • Self-reported

78.6%

BFCL v2

0-shot, accuracy AI: Prompt Steerability • Self-reported

67.0%

IFEval

Average value (accuracy instructions/prompts strict/) • Self-reported

77.4%

InfiniteBench/En.MC

0-shot, longbook_choice/acc • Self-reported

63.3%

InfiniteBench/En.QA

0-shot, longbook_qa/f1 • Self-reported

19.8%

Nexus

0-shot, macro_avg/acc • Self-reported

34.3%

NIH/Multi-needle

0-shot, reproduction AI: model used which in ? • Self-reported

84.7%

Open-rewrite

0-shot, micro_avg/rougeL • Self-reported

40.1%

TLDR9+ (test)

1-shot, rougeL • Self-reported

19.0%

License & Metadata

License

llama_3_2_community_license

Announcement Date

September 25, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Llama 3.1 8B Instruct

Llama 3.1 Nemotron Nano 8B V1

NVIDIA

8.0B

Best score:0.5 (GPQA)

Released:Mar 2025

Gemma 2 9B

Google

9.2B

Best score:0.7 (MMLU)

Released:Jun 2024

Ministral 8B Instruct

Mistral AI

8.0B

Best score:0.7 (ARC)

Released:Oct 2024

Price:$0.10/1M tokens

Phi-3.5-mini-instruct

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Aug 2024

Price:$0.10/1M tokens

Phi 4 Mini

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Feb 2025

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Qwen2 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Jul 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.