Pixtral-12B

Name: Pixtral-12B
Author: Mistral AI

Multimodal

Mistral AI

Multimodal model with 12 billion parameters and a 400 million parameter visual encoder, capable of understanding both natural images and documents. It excels at multimodal tasks while maintaining high quality text-only performance. Supports images of various sizes and multiple images in context.

Key Specifications

Parameters

12.4B

Context

128.0K

Release Date

September 17, 2024

Average Score

66.8%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

September 17, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

12.4B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.15

Output (per 1M tokens)

$0.15

Max Input Tokens

128.0K

Max Output Tokens

8.2K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

5-shot • Self-reported

69.2%

Programming

Programming skills tests

HumanEval

Pass@1 Metric Pass@1 evaluates, how many problems from set model can solve with first attempts (with one ). This means, that for each tasks only first solution, model. Value Pass@1 shows, which percentage tasks model can solve directly, without capabilities corrections its answers or several attempts. This metric, since she/it not allows model on its or its answer. value Pass@1 indicates on ability model give correct answers immediately, that especially important in scenarios, where users and exact results without necessity queries • Self-reported

72.0%

Mathematics

Mathematical problems and computations

MATH

Pass@1 In tasks evaluation model, especially in solutions tasks, Pass@1 is metric, percentage tasks, which model solves correctly with first attempts. This strict measure performance, which not allows model several attempts solutions or its answer. if model represents one solution for each from N tasks, and from them k solutions then Pass@1 = k/N. In context coding or mathematical tasks, where evaluation correctness (solution or correct, or no), Pass@1 ensures metric, not attempts or In difference from metrics, or Pass@1 measures base reliability model at execution tasks without capabilities verification or • Self-reported

48.1%

Multimodal

Working with images and visual data

ChartQA

Chain thinking (Chain of Thought, CoT) AI: Let's process chains thinking. problem step for step. First I task, in order to understand, that from is required. Then I solve task, its thoughts on each For mathematical tasks I I will task on components, find corresponding and to solving. For reasoning I I will its logic, consider various and all aspects problems. Chain thinking helps me errors, thoughts and to correct answers. each step, I I can track its process and detect errors or incorrect This method especially useful for complex tasks, requiring multi-step reasoning • Self-reported

81.8%

DocVQA

ANLS Average answers (ANLS) - this metric, used for evaluation quality answers on questions by VQA (Visual Question Answering) or DocVQA. She/It uses function computation NLS (), which better suits for evaluation answers on questions, than exact ANLS measures between answer and answer, some differences, which not on correctness answer (for example, "1990" and "1990 year" or "" and "With. "). This makes its more metric for evaluation systems, on user. Metric value from 0 to 1, where values to 1 indicate on more exact match between and answers • Self-reported

90.7%

MathVista

Chain of Thought (CoT) AI: Method "chain reasoning" (Chain of Thought, CoT) - this technique, which offers models step by step solve tasks, explicitly showing intermediate steps reasoning. Instead that in order to immediately answer, model sequence logical steps, to This especially useful for tasks, which require several steps reasoning, such how mathematical tasks, logical puzzles and assignments, requiring analysis. Research show, that prompts "let's think step for step" or similar instructions can significantly improve performance model without which-or additional settings. CoT especially in complex tasks and can be other how Self-Consistency, when model generates several chains reasoning and most result • Self-reported

58.0%

MMMU

Chain reasoning (Chain of Thought, CoT) AI: Chain reasoning (Chain of Thought, CoT) • Self-reported

52.5%

Other Tests

Specialized benchmarks

IFEval

Text Instruction Following Score For evaluation abilities model follow instructions we we measure, how well well model should specific instructions by its answer. These assignments represent itself output and general knowledge. For example, we model about "three " and we ask her/its use in answer. We also we ask model explain, that such answer only three Tasks are evaluated by two criteria: 1. Accuracy : factual information 2. : exact format This score based on approach, in MT-Bench • Self-reported

61.3%

MM IF-Eval

Evaluation instructions This evaluation measures, how well well model understands and should complex instructions, which include how text, so and images. We we evaluate model on her/its abilities: - images - instructions, information - reasoning to on basis visual data Examples tasks: 1. "that on and then three possible application this " 2. "If on is human, its ; if this time " 3. "and all errors in mathematical on " Methodology evaluation: - Each task is evaluated by scale from 0 to 5 - accuracy execution instructions - application reasoning This metric especially important for models, which will in capacity for tasks where context can be critically important for correct answer • Self-reported

52.7%

MM-MT-Bench

Multimodal MT-Bench Score AI: Multimodal MT-Bench Score • Self-reported

60.5%

MT-Bench

Text MT-Bench Score Evaluation MT-Bench for model provides measurement quality and abilities model at execution tasks language. Evaluation MT-Bench is score performance model by set assignments, for verification various aspects understanding and generation language. evaluation MT-Bench indicates on then, that model well handles with tasks, how reasoning, generalization and answers on questions. This means, that model demonstrates understanding language and can generate exact and answers. Evaluations MT-Bench can interpret following manner: • Evaluations above 8.0: performance, with models artificial intelligence • Evaluations 7.0-8.0: performance with understanding language • Evaluations 6.0-7.0: performance with some • Evaluations 5.0-6.0: performance with • Evaluations below 5.0: performance, which can not complex tasks Comparison evaluations MT-Bench different models can help choose most model for its specific especially when performance in specific language tasks has value • Self-reported

76.8%

VQAv2

VQA Match : metric, measure quality work models artificial intelligence in tasks answer on questions (VQA). Method: In difference from tasks with set answers or tasks type "/no", metric VQA Match is applied to answers on questions about Metric provides value from 0 to 1, degree between answer model and reference answer. Process evaluation: 1. For given answer a and answer â sim(a, â) answers for obtaining evaluation 2. three evaluation: • Exact match: 1 score, if answers • : between answers • match: is used for answers on basis algorithms processing language Advantages: - with various answers, including numerical, and evaluates answers, by-for large data and various fields application Application: This metric is used for evaluation in tasks answer on questions, that allows conduct comparison various models machine training, with and text • Self-reported

78.6%

License & Metadata

License

apache_2_0

Announcement Date

September 17, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Mistral Small 3.2 24B Instruct

Mistral AI

MM23.6B

Best score:0.9 (HumanEval)

Released:Jun 2025

Mistral Small 3 24B Base

Mistral AI

MM23.6B

Best score:0.9 (ARC)

Released:Jan 2025

Mistral Small 3.1 24B Instruct

Mistral AI

MM24.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Magistral Medium

Mistral AI

MM24.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Mistral Small 3.1 24B Base

Mistral AI

MM24.0B

Best score:0.8 (MMLU)

Released:Mar 2025

Price:$0.10/1M tokens

Mistral Small 3 24B Instruct

Mistral AI

24.0B

Best score:0.8 (HumanEval)

Released:Jan 2025

Price:$0.10/1M tokens

Mistral NeMo Instruct

Mistral AI

12.0B

Best score:0.7 (MMLU)

Released:Jul 2024

Price:$0.15/1M tokens

Magistral Small 2506

Mistral AI

24.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.