Llama 3.2 90B Instruct

Name: Llama 3.2 90B Instruct
Author: Meta

Multimodal

Key Specifications

Parameters

90.0B

Context

128.0K

Release Date

September 25, 2024

Average Score

71.3%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

September 25, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

90.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$1.20

Output (per 1M tokens)

$1.20

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

0-shot CoT Chain-of-thought — this method, at which model generates step-by-step reasoning for solutions tasks, before than give final answer. In context 0-shot CoT model generate chain reasoning without any-or examples. Usually this by means of prompting, for example "Let's let's think about this step for step" or "Let's let's solve this problem step by step". This method for tasks, requiring complex reasoning, such how mathematical tasks, logical puzzles or tasks multi-step decision-making solutions. 0-shot CoT helps model break down complex task on managed subtasks, accuracy and reasoning model. In difference from few-shot CoT, which demonstrates examples reasoning, 0-shot CoT on internal abilities model to reasoning without examples, that makes its more for various types tasks • Self-reported

86.0%

Mathematics

Mathematical problems and computations

MATH

0-shot CoT Method step-by-step thinking without examples (0-shot Chain-of-Thought, 0-shot CoT) — this approach to solving problems, which encourages language model its course thoughts at answer on complex questions. In difference from obtaining answers, 0-shot CoT model to chains reasoning before final answer. This method was presented in research "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., 2022). difference 0-shot CoT from standard CoT consists in that, that he not requires examples reasoning. Instead this he uses instruction, for example "Let's let's think step for step", in order to stimulate process thinking. Advantages 0-shot CoT: - Not requires creation examples demonstration - More in different contexts - probability to specific examples - Can be to tasks Efficiency 0-shot CoT especially at solving mathematical tasks, logical puzzles and tasks, requiring multi-step reasoning. Research show, that addition simple prompts "Let's let's think step for step" can significantly improve accuracy answers LLM in these fields. However 0-shot CoT can be less than few-shot CoT for complex tasks or specialized fields, where model from specific examples reasoning • Self-reported

68.0%

MGSM

0-shot CoT Approach Chain-of-Thought (CoT) without examples (0-shot) — this technique, at which model (LLM) step-by-step reasoning before provision answer on problem, not relying on on examples. This with help query, such how "step for step", that encourages model break down complex task on intermediate reasoning, before than to answer. This method significantly improves performance LLM at solving tasks, requiring complex reasoning, such how logical, and problems. 0-shot CoT allows model its chain reasoning, that especially useful, when examples or their difficult Although performance this method can be below, than at CoT with examples (few-shot CoT), he not requires thoroughly examples, that makes its more in and less from-for choice examples • Self-reported

86.9%

Reasoning

Logical reasoning and analysis

GPQA

0-shot CoT In 0-shot CoT, model use step-by-step reasoning for solutions tasks, but without provision specific examples, chain reasoning. Methodology was first in work Kojima et al., 2022. discovered, that addition phrases "Let's let's think step by step" to query can significantly improve performance large language models in solving tasks reasoning. In our we various version this instructions, for example: • "Let's let's think step by step." • "this problem step for step." • "Let's let's solve this task step by step." tests showed, that results usually achieves option "Let's let's solve this task step by step", therefore we we use this instruction in our main • Self-reported

46.7%

Multimodal

Working with images and visual data

AI2D

# Analysis answers OpenAI o-1 on tasks by mathematics ## In this we we present analysis efficiency OpenAI o-1, in mode tools, at solving tasks from AIME, FrontierMath, and Harvard-MIT Mathematics Tournament. We we compare efficiency o-1 with GPT-4 and Claude 3 Opus. Results show improvement by comparison with models, at this o-1 outperforms GPT-4 in all sets data and Claude 3 Opus in two from three We also we present analysis solutions o-1, her/its strong side and limitations. ## OpenAI o-1, new model, which that "significantly improves capabilities reasoning and quality code" by comparison with GPT-4. In this we we analyze efficiency o-1 at solving complex tasks by mathematics, her/its with other models. We o-1 with using tools Python, GPT-4 with using Code Interpreter, and Claude 3 Opus with using tools Claude. All model could use Python for help in its that even when we GPT-4 without Code Interpreter, we that model so, how at her was access to Therefore we explicitly provide its for all models • Self-reported

92.3%

ChartQA

# complexity with ## and general conclusions this test was evaluate logical reasoning model through complexity. Each was thoroughly in order to verify specific aspects reasoning, including logic, options and models. Since logical usually require multi-step reasoning, they well suit for evaluation abilities model to were with level complexity, with simple logical tasks and tasks, more logical output and several **conclusions:** Model abilities to reasoning on main and complexity. She/It in options and verification solutions. On most complex model sometimes errors, especially when reasoning with several but in whole solve majority correctly. ## methodology from set from 12 logical complexity, on three categories: 1. **logical (4)** - on simple reasoning and options 2. **complexity logical (4)** - more and models 3. **logical (4)** - complex reasoning with several Each by following criteria: - Correctness final answer - Quality logical reasoning - Ability track and apply all solutions - when this necessary ## results by ### logical Model reasoning, all 4 basic correctly. She/It sequentially • Self-reported

85.5%

DocVQA

# tokens ## Definition and application **Verification tokens (Stop Token Counting, STC)** evaluates ability model process limitations in its answers. For this model question and indicate, that her/its answer should from number tokens (words, or characters). In context large language models (LLM) **** — this or sequence characters, which indicate on generation text. STC verifies, can whether model exactly own and on ## Methodology ### General structure test 1. Models is provided instruction, answer on question using exact number tokens (for example, "on this question, using 20 words"). 2. number tokens in answer model. 3. is determined accuracy ### * **words**: on number words in answer. * ****: on number * **characters**: on number characters, including * **tokens**: on number tokens in with specific (more complex option). ## STC measures several abilities: 1. **abilities**: whether model process generation and can whether its ? 2. **Accuracy **: Can whether model correctly language? 3. ****: How well well model limitations, ? ## Examples instructions * "using 15 words." * "how works in 3 Not more and not " * "that such using 100 characters, including " ## results * **match**: Model exactly * ****: Model from • Self-reported

90.1%

MathVista

# query under "query" allows model answer on complex question, step by step chain reasoning and intermediate results. This technique offers way problems "" in long reasoning, when model can or intermediate results. Using "" (or ), model information on total process reasoning. demonstrates improvement performance on tasks, requiring complex reasoning, such how tasks mathematical from AIME. ## How this works 1. Models is provided "" () — in query, where she/it can intermediate results. 2. Instead that in order to chain reasoning in "" (in context model), model key intermediate results in 3. By in solving model can to in order to information, which she/it 4. by that, how model new intermediate results. ## Example use ``` Task: [] : [thinking, which can be and complex] : 1. [result 1] 2. [result 2] ... Answer: [] ``` Model new intermediate results or existing. ## this works - ****: Model not tries all aspects solutions "in ". - **probability **: Key intermediate results and not on model. - **approach**: model solution step by step and track progress. - ****: process reasoning more that allows detect errors. ## Application This technique especially useful for: - mathematical tasks - • Self-reported

57.3%

MMMU

0-shot CoT This method, which encourages model think sequentially before provision final answer. In difference from few-shot CoT, this method not requires examples reasoning. Instead this model is provided instruction type "let's let's think step for step" or "let's let's solve this task step by step" before that, how question. This simple approach model generate intermediate reasoning, which often to more exact answers, especially for complex tasks, such how logical or tasks on reasoning. 0-shot CoT is more by comparison with few-shot CoT, since not requires examples for each type tasks. However quality reasoning can in dependency from capabilities model and complexity tasks • Self-reported

60.3%

Other Tests

Specialized benchmarks

InfographicsQA

for language models In this we about one method improvements language models (LLM) — — this method, which indicates model, she/it should at generation answer. Since language model on data, and information, model can sometimes generate answers. from solutions this problems is RLHF (training with on basis connection from human), which process models on basis However and more simple methods, such how — this technique, where you model specific types answers. This can be effectively, when you in order to model long answers, not specific errors or from specific ## Examples prompting especially useful in situations, when model explanations, or when they not or when answers or several examples prompting, which can in queries for improvements answers model: - "Not answer with " - "Not that you AI-" - "Please, let's answers without explanations." - "answers — " - "Not phrases type 'I question' or 'I I can help'." - "long and answers." - "Not answer from-for in question." When instructions be For example, instead "Not too " "its answer 100 and not ". ## Limitations prompting Although can be useful, he has • Self-reported

56.8%

MMMU-Pro

0-shot CoT This method, at which model ask solve task, including prompt "Let's solve step for step", in order to encourage model show its course reasoning. This way improve performance LLM, not provision examples reasoning. This approach works how for tasks common (sense) meaning, so and for more complex mathematical problems, and is chains reasoning with examples (few-shot CoT) • Self-reported

45.2%

TextVQA

In several research in field on simple and to set algorithms. them in method (Number Field Sieve - NFS) for and method (Function Field Sieve - FFS) for over These achievements were not only : modern methods on complexity these problems, especially RSA for and various on including and Despite on progress in algorithms for these tasks, all still important questions complexity and these methods. More that, and algorithms (especially ) represents for systems, on these complex In this research we we present new approach to on and methods with and approach gives improvement on some data and offers for in this field • Self-reported

73.5%

VQAv2

# AIME (level, for mathematics) ## Description assignments American Invitational Mathematics Examination (AIME) - this complex 15-by mathematics for in Each question has answer in form numbers from 0 to 999. These tasks standard tasks and require and approaches to solving. ## Method evaluation We we evaluate each model LLM on 10 tasks AIME, provide answer and full solution. For each tasks we we evaluate two : 1. **answer**: whether value with correct answer? 2. **Correctness solutions**: whether solution and leads whether to correct answer? For tasks, requiring solutions, we we use following approach evaluation: - We we ask model provide solution and final answer. - Model should follow correct for obtaining points. - We not on model correctness solutions. ## set tasks We we use 10 tasks AIME. These tasks: - various field in mathematics (numbers, ) - approaches and thinking - itself example real with which mathematics in Set tasks from competitions AIME and for representations diverse fields mathematics and levels complexity • Self-reported

78.1%

License & Metadata

License

llama3_2

Announcement Date

September 25, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Llama 3.2 11B Instruct

Llama 4 Scout

Gemma 3 27B

Google

MM27.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Price:$0.11/1M tokens

Gemma 3 12B

Google

MM12.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Price:$0.05/1M tokens

GPT OSS 20B

OpenAI

MM20.0B

Best score:0.9 (MMLU)

Released:Aug 2025

Price:$0.10/1M tokens

Mistral Small 3.2 24B Instruct

Mistral AI

MM23.6B

Best score:0.9 (HumanEval)

Released:Jun 2025

Magistral Medium

Mistral AI

MM24.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Mistral Small 3 24B Base

Mistral AI

MM23.6B

Best score:0.9 (ARC)

Released:Jan 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.