Llama 4 Scout
MultimodalLlama 4 Scout is a natively multimodal model capable of processing both text and images. It uses a Mixture-of-Experts (MoE) architecture with 17 billion active parameters (109 billion total) and 16 experts, supporting a wide range of multimodal tasks such as conversational interaction, image analysis, and code generation. The model features a 10 million token context window.
Key Specifications
Parameters
109.0B
Context
10.0M
Release Date
April 5, 2025
Average Score
67.3%
Timeline
Key dates in the model's history
Announcement
April 5, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
109.0B
Training Tokens
40.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.18
Output (per 1M tokens)
$0.59
Max Input Tokens
10.0M
Max Output Tokens
10.0M
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
5-macro_avg/acc_char AI: 5-shot macro_avg/acc_char • Self-reported
Programming
Programming skills tests
MBPP
3-shot pass@1 In tasks, in which model should give correct answer with first attempts, we we evaluate accuracy model at manner three examples for each tasks. We this metric "3-shot pass@1", so how model has access to examples and should give correct answer with first attempts. We we use 3-shot pass@1 for tasks, where exists correct answer, which can verify automatically, for example, solution tasks by mathematics or • Self-reported
Mathematics
Mathematical problems and computations
MATH
4-shot em_maj1@1 This method determines accuracy model, evaluating, can whether she/it give although would one correct answer at several attempts. Method uses following steps: 1. n answers on each question, using different prompts or temperature. In our case n=4. 2. Answer is considered correct, if model exact answer although would in one from n attempts. 3. accuracy by all tasks, task if although would one answer from n was correct. this method evaluates ability model give correct answer "although would one times", that useful for understanding capabilities model at attempts. This especially at evaluation complex tasks, where model can sometimes find correct solution, but not always sequentially • Self-reported
MGSM
0-shot (average/em) • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
0-shot (accuracy) • Self-reported
Multimodal
Working with images and visual data
ChartQA
# 0-shot CoT This method is applied to tasks on reasoning. He in LLM for improvements mathematical abilities and general skills reasoning. Method LLM «step for step», in order to solution instead that, in order to immediately to answer. Important, that for this method not samples step-by-step solutions. ## prompt ``` Q: [task] A: Let's let's think step for step. ``` ## Example use ``` Q: 90% by five score he should obtain on in order to obtain score not less 92%? A: Let's let's think step for step. ``` ## Analysis 0-shot CoT — this and simple in method for improvements reasoning LLM. He based on that modern LLM follow instructions in context without examples. «Let's let's think step for step» usually leads to results by comparison with without prompting to reasoning. This method especially useful for tasks, which can be by means of logical steps • Self-reported
DocVQA
# 0-shot (ANLS) Metric Average Normalized Levenshtein Similarity (ANLS) for tasks understanding language — this metric, which measures between answer and answer with possible in and words. She/It is calculated by means of (differences between ) by and values by all examples. For we (NLD) to (ANLS): ANLS(pred, target) = max(0, 1 - NLD(pred, target)) where NLD — : NLD(pred, target) = LD(pred, target) / max(|pred|, |target|) LD — between and target and |x| — x. Then ANLS by all examples in set data • Self-reported
MathVista
0-shot CoT Chain-of-thought (CoT) — this method obtaining reasoning from language models. Despite on its efficiency, he not very well works with 0-shot examples (without examples reasoning). This task verifies ability model on 0-shot CoT prompt. Task on then, in order to evaluate, can whether model give step-by-step justification its answers after query "Let's let's think about this step for step", when no examples such reasoning. We we evaluate, (1) gives whether model reasoning "chain thoughts" and (2) if leads whether this to correct answer. type "Let's let's think about this step for step" model, that we her/its process reasoning, and not only final answer. This on then, how can ask show its work at solving mathematical tasks. Using CoT, model its process thinking, that can lead to to more exact answers on complex questions • Self-reported
MMMU
# 0-shot CoT Zero-shot Chain-of-Thought (0-shot CoT) — this simple, but method improvements reasoning language model by means of use prompts "Let's let's think step for step" ("Let's think step by step"), not providing at this any-or examples. In difference from few-shot CoT, where model show examples chains reasoning for various tasks, 0-shot CoT exclusively on use simple prompts, in order to stimulate model step-by-step reasoning. When with which simply model solve problem, 0-shot CoT encourages model break down solution on steps, that often leads to more exact results. Despite on its method "Let's let's think step for step" for improvements mathematical abilities and abilities logical reasoning models. This method especially useful, when at you no examples or when examples for specific tasks can be too complex or • Self-reported
Other Tests
Specialized benchmarks
LiveCodeBench
# 0-shot CoT 0-shot Chain-of-Thought (0-shot CoT) includes in itself provision model prompts solve task step by step, without examples that, how this do. model to reasoning, is "Let's let's think about this step for step". In work Wei et al. (2022) was that addition this simple phrases to query significantly improves performance model on tasks reasoning. This especially effectively for tasks, tasks reasoning and tasks on ## Application in context evaluation When use 0-shot CoT for evaluation, model: 1. task without examples solutions 2. solve task, her/its on logical steps 3. step-by-step reasoning, before than give final answer Such approach often leads to more exact answers, since model explicitly its reasoning, that can identify and logical errors • Self-reported
MMLU-Pro
0-shot (macro_avg/acc) • Self-reported
TydiQA
1-shot average/f1 This method evaluates efficiency model in solving tasks at context. In conditions 1-shot model receives only one example for training before assignments. Average/f1 relates to to results: - Average: is calculated average value correct answers by all tasks - f1: represents itself average harmonic between accuracy (precision) and (recall), more metric for tasks classification Method especially useful for evaluation abilities model to and generalization in conditions data. scores by this can on weak abilities model to few-shot training, in then time how results about abilities model patterns from number examples • Self-reported
License & Metadata
License
llama_4_community_license_agreement
Announcement Date
April 5, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsLlama 4 Maverick
Meta
MM400.0B
Best score:0.9 (MMLU)
Released:Apr 2025
Price:$0.27/1M tokens
Llama 3.2 90B Instruct
Meta
MM90.0B
Best score:0.9 (MMLU)
Released:Sep 2024
Price:$1.20/1M tokens
Llama 3.2 11B Instruct
Meta
MM10.6B
Best score:0.7 (MMLU)
Released:Sep 2024
Price:$0.18/1M tokens
Llama 3.1 8B Instruct
Meta
8.0B
Best score:0.8 (ARC)
Released:Jul 2024
Price:$0.20/1M tokens
Llama 3.1 405B Instruct
Meta
405.0B
Best score:1.0 (ARC)
Released:Jul 2024
Price:$3.50/1M tokens
MiniMax M2.5
MiniMax
MM230.0B
Released:Feb 2026
Mistral Large 3 (675B Instruct 2512)
Mistral AI
MM675.0B
Best score:0.4 (GPQA)
Released:Dec 2025
Price:$0.50/1M tokens
GLM-4.5V
Zhipu AI
MM108.0B
Released:Aug 2025
Price:$0.60/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.