Mistral Small 3.2 24B Instruct

Name: Mistral Small 3.2 24B Instruct
Author: Mistral AI

Multimodal

Mistral AI

Mistral-Small-3.2-24B-Instruct-2506 is a minor update to the Mistral-Small-3.1-24B-Instruct-2503 model.

Key Specifications

Parameters

23.6B

Context

Release Date

June 20, 2025

Average Score

68.2%

Model Weights

Timeline

Key dates in the model's history

Announcement

June 20, 2025

Last Update

August 3, 2025

Today

May 10, 2026

Technical Specifications

Parameters

23.6B

Training Tokens

Knowledge Cutoff

October 1, 2023

Family

Fine-tuned from

mistral-small-3.1-24b-base-2503

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

5-shot • Self-reported

80.5%

Mathematics

Mathematical problems and computations

MATH

5-shot This method model five examples "correct" execution tasks before query on execution target tasks. examples from data and should be testHow will look prompt: <Demonstration question 1> <Demonstration answer 1> <Demonstration question 2> <Demonstration answer 2> <Demonstration question 3> <Demonstration answer 3> <Demonstration question 4> <Demonstration answer 4> <Demonstration question 5> <Demonstration answer 5> <question> This method is approach to LLM, so how he reflects then, how people can effectively use model on and often leads to results • Self-reported

69.4%

Reasoning

Logical reasoning and analysis

GPQA

5-shot CoT Method 5-shot Chain-of-Thought (CoT) efficiency work model for provision several examples reasoning at solving tasks. Instead simple demonstration "question-answer", in each example intermediate steps reasoning, which to solving. This approach allows model structure process solutions and apply reasoning to new tasks. Research show, that 5-shot CoT significantly improves performance large language models in tasks, requiring complex multi-step reasoning, such how mathematical puzzles, logical tasks and tasks, requiring common (sense) meaning. For application 5-shot CoT five examples with reasoning, which then in prompt before new task. — process thinking ("let's let's think step for step"), and not only result. This method especially efficient, since he advantages few-shot learning (training on several examples) with Chain-of-Thought, not at this additional settings model or • Self-reported

44.2%

GPQA

5-shot CoT Method chains thinking (Chain of Thought, CoT) with 5 examples - this approach to reasoning, at which model LLM are provided several samples step-by-step solutions problems, in order to help it structure own reasoning. Chain thinking allows models perform intermediate steps reasoning before final answer. several examples (in given case 5) with helps model understand, how break down complex tasks on managed stages. In context mathematical tasks 5-shot CoT five tasks with their solutions, process thinking. Each example includes in itself task, and then explanation that, how solve this task, before than give final answer. This method especially useful for complex tasks, requiring several steps reasoning, and was how way improvements performance models in various tasks, requiring logical and thinking • Self-reported

46.1%

Multimodal

Working with images and visual data

AI2D

Query to model includes with tool, in which model should solve task by mathematics or On example such tasks we we can evaluate, how model understands task, her/its steps to solving, tool use and answer. two tool use for solutions tasks: * First solve task independently, then verify answer with help tool * tool for execution steps in process solutions tasks approach can be in dependency from tasks. We we evaluate model with tool by following criteria: * : How well model query to ? * : How well use tool in given context? * : How well model to with instead general queries? * : How well well model results, obtained from tool? In model should use tool how addition to reasoning, and not fully task • Self-reported

92.9%

ChartQA

artificial intelligence many in field AI — understand, how work language model. In then indeed time you simply model about that, how she/it solves task! However often that model not direct to its therefore their (how they that they ) not very not less, we showed, that GPT-4 sufficiently exactly about its process solutions tasks by mathematics and programming. We this general approach "AI": model in solving complex tasks, it identify internal and In example we we ask GPT-4 solve complex task and explain its reasoning. When model achieves we we verify her/its thinking, her/its its own errors and we offer general about that, how but: 1. Model very exactly about that, that she/it tries make, even when her/its steps reasoning 2. Model can identify its own errors, when her/its questions about its solving. 3. Model can correct its approach, if her/its stimulate think about that, that not so. This technique helps specific in reasoning — when model necessary concepts, but applies their incorrectly. work new capabilities for improvements work AI and its We we consider, that specific reasoning model, are to training models more solving tasks and with by complex that • Self-reported

87.4%

DocVQA

# for with text (STM) For tasks mathematics level useful work with and STM (System Text Manipulation) — this tools, which we developed for GPT-4o for such processes, ability and text for its reasoning. STM provide API for various tasks, such how text, addition to and search by These tools were for processing text, from reasoning, in LLM, allowing model effectively process structured text and mathematical equations without necessity track in context. With STM GPT-4o can work over solutions chains computations and with by means of creation, and text in Without these tools model should was would conclusions in its answers, in order to show solution, effectively reasoning • Self-reported

94.9%

MathVista

Analysis models and user on data, performance models in solving tasks, we execution tasks, with behavior below level to performance on level This structure accounts for approach to solving problems, determines key aspects problems, necessary steps, errors and limitations. on main categories performance: - **(N)**: Model often allows errors and receives correct answers. can how provision answer, on understanding problems. - **(B)**: Model understands problem and can apply corresponding but allows significant errors in **(P)**: Model demonstrates approaches to solving tasks and receives correct answers, but by-This can include errors from-for application well **(E)**: Model demonstrates always correct answers with help and exact solutions. In our analysis we behavior model in task how one from these and nuances or patterns, which can : comparison — "" for specific tasks can from or in field, and not from experts • Self-reported

67.1%

MMMU

# methodology on which can improve performance LLM at solving complex tasks. We we determine strategy how which can be for improvements abilities model solve tasks specific type. We strategies by means of research behavior models and solutions, which strong computational abilities, and that, can whether these abilities be and for improvement performance. ## Approach For we: 1. **behavior model with various query.** We we use diverse settings instructions, example solutions and managed prompts, in order to behavior. 2. **examples.** specific examples, where model demonstrate especially strong or weak results, we analysis processes, which to this results. 3. **in ** approaches, we general and we verify, can whether they be sequentially 4. **strategies on more set examples.** strategy, we her/its efficiency on other similar tasks. 5. **strategies.** On basis results testing we strategies and more their application. ## For each strategies we representation, which includes: - **Definition and :** description strategies and her/its work. - **Examples application:** examples strategies in **:** why can be on our capabilities and limitations models. - **by :** by that, when and how apply strategy for performance • Self-reported

62.5%

Other Tests

Specialized benchmarks

Arena Hard

Methodology evaluation abilities model to reasoning This describes for evaluation abilities model, using tasks how main tool for testing. In difference from benchmarks, which often show only accuracy or rating, we we offer research that, how model handles with steps, to solving. this important: • measurement (correctly/incorrectly) model • necessary understand not only model can solve task, but and she/it Example evaluation: 1. task, several logical steps (for example, solution puzzles, data) 2. model: a. task on subtasks b. each sequentially c. its reasoning on each 3. Analysis results: a. On step model error? b. whether model structure tasks? c. whether model its errors at verification? d. How accuracy with number steps? 4. : a. prompts level b. tasks, maintaining base structure c. with different (thinking) : This can apply for comparison models, identification specific fields for improvements and more deep understanding capabilities systems. He especially for over abilities models to reasoning • Self-reported

43.1%

HumanEval Plus

Pass@5 Pass@5 — this metric probability that, that model will solve task correctly, 5 attempts, best answer. Process computation: 1. Model generates n different solutions (n > 5) 2. solution for obtaining answer 3. solutions by probability correctness 4. We we verify, whether correct answer among 5 solutions Metric Pass@5 shows ability model generate diverse solutions and most evaluating their probability correctness. This more use model, than attempt, since users often ask model generate several approaches to solving • Self-reported

92.9%

# ROME: Verification facts in LLM with help ## "" in LLM "" (ROME) — this method evaluation answers, LLM. This method allows detect and incorrect answers even at basic knowledge, new approach to verification facts in LLM. ### How works ROME ROME based on : when model in she/it this confidence even at question various if model information, her/its answers usually will less at question. following example: **question:** general size by comparison with ? **Answer model:** approximately by In order to verify, indeed whether model "" in this ROME generates diverse question, such how: - How well by to ? - and more or ? On how many? Then method answers on these for determination level If answers sequentially statement, probability that, that model indeed "" in this If answers this can on or confidence. ### ROME 1. **Not requires basic knowledge**: In difference from systems verification facts, ROME not in to knowledge for determination accuracy. 2. **Method ""**: with LLM through API, not access to or model. 3. **accuracy**: In our testing method ROME high accuracy in actual errors at use with LLM, including GPT-4. 4. **Efficiency**: Method especially efficient in subject fields, where model often errors (for example, mathematics, logical reasoning). ### Computation evaluation • Self-reported

84.8%

MBPP Plus

Pass@5 Method Pass@5 evaluates probability that, that model will solve task although would one times at five attempts. He based on that users usually make several attempts at use AI for solutions complex tasks. In difference from accuracy first attempts (which shows only, how well well model works with first times), Pass@5 measures performance in scenarios use. Pass@5 is calculated by means of obtaining from model five independent solutions for each tasks, then solved whether model task although would in one from five attempts. how proportion tasks, which model successfully solved although would one times for five attempts. This method especially useful for evaluation LLM at solving complex tasks, where degree Pass@5 ensures more performance in real scenarios use, that users can do several attempts, when use AI-tools for solutions complex tasks • Self-reported

78.3%

MMLU-Pro

5-shot CoT In this method we we provide model five examples, process thinking, and then we offer solve new task. This technique allows model follow reasoning and apply approach to new Research show, that provision examples solutions significantly improves ability models with complex tasks. Demonstration logical chains reasoning in examples helps model structure process thinking. This method especially efficient for mathematical and logical tasks, where critically important process solutions. He also helps models errors reasoning and "", which can lead to to answers. 5-shot CoT, we its which model through complex process solutions, that especially useful for tasks, requiring multi-step reasoning or complex computations • Self-reported

69.1%

SimpleQA

TotalAcc AI: 0 • Self-reported

12.1%

Wild Bench

Method CAA (Centered Alignment Analysis) degree between models and For set samples thinking, CAA how well well model (1) (2) structure and (3) in When verification processing, CAA evaluates, can whether model exactly process, in thinking, on and reasoning model, and evaluating, how well well model can this indeed process for assignments. When verification structure, CAA evaluates, can whether model exactly determine main in model thinking and When verification CAA evaluates, can whether model exactly determine, how key and in and analysis. In these tests give measurement between models and in and and on specific field for for evaluation CAA provides and approach to that, how well model can process thinking • Self-reported

65.3%