Meta logo

Llama 4 Scout

Multimodal
Meta

Llama 4 Scout is a natively multimodal model capable of processing both text and images. It uses a Mixture-of-Experts (MoE) architecture with 17 billion active parameters (109 billion total) and 16 experts, supporting a wide range of multimodal tasks such as conversational interaction, image analysis, and code generation. The model features a 10 million token context window.

Key Specifications

Parameters
109.0B
Context
10.0M
Release Date
April 5, 2025
Average Score
67.3%

Timeline

Key dates in the model's history
Announcement
April 5, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
109.0B
Training Tokens
40.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.18
Output (per 1M tokens)
$0.59
Max Input Tokens
10.0M
Max Output Tokens
10.0M
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
5-macro_avg/acc_char AI: 5-shot macro_avg/acc_charSelf-reported
79.6%

Programming

Programming skills tests
MBPP
3-shot pass@1 In tasks, in which model should give correct answer with first attempts, we we evaluate accuracy model at manner three examples for each tasks. We this metric "3-shot pass@1", so how model has access to examples and should give correct answer with first attempts. We we use 3-shot pass@1 for tasks, where exists correct answer, which can verify automatically, for example, solution tasks by mathematics orSelf-reported
67.8%

Mathematics

Mathematical problems and computations
MATH
4-shot em_maj1@1 This method determines accuracy model, evaluating, can whether she/it give although would one correct answer at several attempts. Method uses following steps: 1. n answers on each question, using different prompts or temperature. In our case n=4. 2. Answer is considered correct, if model exact answer although would in one from n attempts. 3. accuracy by all tasks, task if although would one answer from n was correct. this method evaluates ability model give correct answer "although would one times", that useful for understanding capabilities model at attempts. This especially at evaluation complex tasks, where model can sometimes find correct solution, but not always sequentiallySelf-reported
50.3%
MGSM
0-shot (average/em)Self-reported
90.6%

Reasoning

Logical reasoning and analysis
GPQA
0-shot (accuracy)Self-reported
57.2%

Multimodal

Working with images and visual data
ChartQA
# 0-shot CoT This method is applied to tasks on reasoning. He in LLM for improvements mathematical abilities and general skills reasoning. Method LLM «step for step», in order to solution instead that, in order to immediately to answer. Important, that for this method not samples step-by-step solutions. ## prompt ``` Q: [task] A: Let's let's think step for step. ``` ## Example use ``` Q: 90% by five score he should obtain on in order to obtain score not less 92%? A: Let's let's think step for step. ``` ## Analysis 0-shot CoT — this and simple in method for improvements reasoning LLM. He based on that modern LLM follow instructions in context without examples. «Let's let's think step for step» usually leads to results by comparison with without prompting to reasoning. This method especially useful for tasks, which can be by means of logical stepsSelf-reported
88.8%
DocVQA
# 0-shot (ANLS) Metric Average Normalized Levenshtein Similarity (ANLS) for tasks understanding language — this metric, which measures between answer and answer with possible in and words. She/It is calculated by means of (differences between ) by and values by all examples. For we (NLD) to (ANLS): ANLS(pred, target) = max(0, 1 - NLD(pred, target)) where NLD — : NLD(pred, target) = LD(pred, target) / max(|pred|, |target|) LD — between and target and |x| — x. Then ANLS by all examples in set dataSelf-reported
94.4%
MathVista
0-shot CoT Chain-of-thought (CoT) — this method obtaining reasoning from language models. Despite on its efficiency, he not very well works with 0-shot examples (without examples reasoning). This task verifies ability model on 0-shot CoT prompt. Task on then, in order to evaluate, can whether model give step-by-step justification its answers after query "Let's let's think about this step for step", when no examples such reasoning. We we evaluate, (1) gives whether model reasoning "chain thoughts" and (2) if leads whether this to correct answer. type "Let's let's think about this step for step" model, that we her/its process reasoning, and not only final answer. This on then, how can ask show its work at solving mathematical tasks. Using CoT, model its process thinking, that can lead to to more exact answers on complex questionsSelf-reported
70.7%
MMMU
# 0-shot CoT Zero-shot Chain-of-Thought (0-shot CoT) — this simple, but method improvements reasoning language model by means of use prompts "Let's let's think step for step" ("Let's think step by step"), not providing at this any-or examples. In difference from few-shot CoT, where model show examples chains reasoning for various tasks, 0-shot CoT exclusively on use simple prompts, in order to stimulate model step-by-step reasoning. When with which simply model solve problem, 0-shot CoT encourages model break down solution on steps, that often leads to more exact results. Despite on its method "Let's let's think step for step" for improvements mathematical abilities and abilities logical reasoning models. This method especially useful, when at you no examples or when examples for specific tasks can be too complex orSelf-reported
69.4%

Other Tests

Specialized benchmarks
LiveCodeBench
# 0-shot CoT 0-shot Chain-of-Thought (0-shot CoT) includes in itself provision model prompts solve task step by step, without examples that, how this do. model to reasoning, is "Let's let's think about this step for step". In work Wei et al. (2022) was that addition this simple phrases to query significantly improves performance model on tasks reasoning. This especially effectively for tasks, tasks reasoning and tasks on ## Application in context evaluation When use 0-shot CoT for evaluation, model: 1. task without examples solutions 2. solve task, her/its on logical steps 3. step-by-step reasoning, before than give final answer Such approach often leads to more exact answers, since model explicitly its reasoning, that can identify and logical errorsSelf-reported
32.8%
MMLU-Pro
0-shot (macro_avg/acc)Self-reported
74.3%
TydiQA
1-shot average/f1 This method evaluates efficiency model in solving tasks at context. In conditions 1-shot model receives only one example for training before assignments. Average/f1 relates to to results: - Average: is calculated average value correct answers by all tasks - f1: represents itself average harmonic between accuracy (precision) and (recall), more metric for tasks classification Method especially useful for evaluation abilities model to and generalization in conditions data. scores by this can on weak abilities model to few-shot training, in then time how results about abilities model patterns from number examplesSelf-reported
31.5%

License & Metadata

License
llama_4_community_license_agreement
Announcement Date
April 5, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.