Llama 4 Scout

Name: Llama 4 Scout
Author: Meta

Multimodal

Meta

Llama 4 Scout is a natively multimodal model capable of processing both text and images. It uses a Mixture-of-Experts (MoE) architecture with 17 billion active parameters (109 billion total) and 16 experts, supporting a wide range of multimodal tasks such as conversational interaction, image analysis, and code generation. The model features a 10 million token context window.

Key Specifications

Parameters

109.0B

Context

10.0M

Release Date

April 5, 2025

Average Score

67.3%

API Documentation Repository Model Weights

Timeline

Key dates in the model's history

Announcement

April 5, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

109.0B

Training Tokens

40.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.18

Output (per 1M tokens)

$0.59

Max Input Tokens

10.0M

Max Output Tokens

10.0M

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

5-macro_avg/acc_char AI: 5-shot macro_avg/acc_char • Self-reported

79.6%

Programming

Programming skills tests

MBPP

3-shot pass@1 In tasks, in which model should give correct answer with first attempts, we we evaluate accuracy model at manner three examples for each tasks. We this metric "3-shot pass@1", so how model has access to examples and should give correct answer with first attempts. We we use 3-shot pass@1 for tasks, where exists correct answer, which can verify automatically, for example, solution tasks by mathematics or • Self-reported

67.8%

Mathematics

Mathematical problems and computations

MATH

4-shot em_maj1@1 This method determines accuracy model, evaluating, can whether she/it give although would one correct answer at several attempts. Method uses following steps: 1. n answers on each question, using different prompts or temperature. In our case n=4. 2. Answer is considered correct, if model exact answer although would in one from n attempts. 3. accuracy by all tasks, task if although would one answer from n was correct. this method evaluates ability model give correct answer "although would one times", that useful for understanding capabilities model at attempts. This especially at evaluation complex tasks, where model can sometimes find correct solution, but not always sequentially • Self-reported

50.3%

MGSM

0-shot (average/em) • Self-reported

90.6%

Reasoning

Logical reasoning and analysis

GPQA

0-shot (accuracy) • Self-reported

57.2%

Multimodal

Working with images and visual data

ChartQA

# 0-shot CoT This method is applied to tasks on reasoning. He in LLM for improvements mathematical abilities and general skills reasoning. Method LLM «step for step», in order to solution instead that, in order to immediately to answer. Important, that for this method not samples step-by-step solutions. ## prompt ``` Q: [task] A: Let's let's think step for step. ``` ## Example use ``` Q: 90% by five score he should obtain on in order to obtain score not less 92%? A: Let's let's think step for step. ``` ## Analysis 0-shot CoT — this and simple in method for improvements reasoning LLM. He based on that modern LLM follow instructions in context without examples. «Let's let's think step for step» usually leads to results by comparison with without prompting to reasoning. This method especially useful for tasks, which can be by means of logical steps • Self-reported

88.8%

DocVQA

# 0-shot (ANLS) Metric Average Normalized Levenshtein Similarity (ANLS) for tasks understanding language — this metric, which measures between answer and answer with possible in and words. She/It is calculated by means of (differences between ) by and values by all examples. For we (NLD) to (ANLS): ANLS(pred, target) = max(0, 1 - NLD(pred, target)) where NLD — : NLD(pred, target) = LD(pred, target) / max(|pred|, |target|) LD — between and target and |x| — x. Then ANLS by all examples in set data • Self-reported

94.4%

MathVista

0-shot CoT Chain-of-thought (CoT) — this method obtaining reasoning from language models. Despite on its efficiency, he not very well works with 0-shot examples (without examples reasoning). This task verifies ability model on 0-shot CoT prompt. Task on then, in order to evaluate, can whether model give step-by-step justification its answers after query "Let's let's think about this step for step", when no examples such reasoning. We we evaluate, (1) gives whether model reasoning "chain thoughts" and (2) if leads whether this to correct answer. type "Let's let's think about this step for step" model, that we her/its process reasoning, and not only final answer. This on then, how can ask show its work at solving mathematical tasks. Using CoT, model its process thinking, that can lead to to more exact answers on complex questions • Self-reported

70.7%

MMMU

# 0-shot CoT Zero-shot Chain-of-Thought (0-shot CoT) — this simple, but method improvements reasoning language model by means of use prompts "Let's let's think step for step" ("Let's think step by step"), not providing at this any-or examples. In difference from few-shot CoT, where model show examples chains reasoning for various tasks, 0-shot CoT exclusively on use simple prompts, in order to stimulate model step-by-step reasoning. When with which simply model solve problem, 0-shot CoT encourages model break down solution on steps, that often leads to more exact results. Despite on its method "Let's let's think step for step" for improvements mathematical abilities and abilities logical reasoning models. This method especially useful, when at you no examples or when examples for specific tasks can be too complex or • Self-reported

69.4%

Other Tests

Specialized benchmarks

LiveCodeBench

# 0-shot CoT 0-shot Chain-of-Thought (0-shot CoT) includes in itself provision model prompts solve task step by step, without examples that, how this do. model to reasoning, is "Let's let's think about this step for step". In work Wei et al. (2022) was that addition this simple phrases to query significantly improves performance model on tasks reasoning. This especially effectively for tasks, tasks reasoning and tasks on ## Application in context evaluation When use 0-shot CoT for evaluation, model: 1. task without examples solutions 2. solve task, her/its on logical steps 3. step-by-step reasoning, before than give final answer Such approach often leads to more exact answers, since model explicitly its reasoning, that can identify and logical errors • Self-reported

32.8%

MMLU-Pro

0-shot (macro_avg/acc) • Self-reported

74.3%

TydiQA

1-shot average/f1 This method evaluates efficiency model in solving tasks at context. In conditions 1-shot model receives only one example for training before assignments. Average/f1 relates to to results: - Average: is calculated average value correct answers by all tasks - f1: represents itself average harmonic between accuracy (precision) and (recall), more metric for tasks classification Method especially useful for evaluation abilities model to and generalization in conditions data. scores by this can on weak abilities model to few-shot training, in then time how results about abilities model patterns from number examples • Self-reported

31.5%

License & Metadata

License

llama_4_community_license_agreement

Announcement Date

April 5, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Llama 4 Maverick

Meta

MM400.0B

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$0.27/1M tokens

Llama 3.2 90B Instruct

Meta

MM90.0B

Best score:0.9 (MMLU)

Released:Sep 2024

Price:$1.20/1M tokens

Llama 3.2 11B Instruct

Meta

MM10.6B

Best score:0.7 (MMLU)

Released:Sep 2024

Price:$0.18/1M tokens

Llama 3.1 8B Instruct

Meta

8.0B

Best score:0.8 (ARC)

Released:Jul 2024

Price:$0.20/1M tokens

Llama 3.1 405B Instruct

Meta

405.0B

Best score:1.0 (ARC)

Released:Jul 2024

Price:$3.50/1M tokens

MiniMax M2.5

MiniMax

MM230.0B

Released:Feb 2026

Mistral Large 3 (675B Instruct 2512)

Mistral AI

MM675.0B

Best score:0.4 (GPQA)

Released:Dec 2025

Price:$0.50/1M tokens

GLM-4.5V

Zhipu AI

MM108.0B

Released:Aug 2025

Price:$0.60/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.