Mistral AI logo

Mistral Small 3.1 24B Instruct

Multimodal
Mistral AI

Building on Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art image understanding and improves long-context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and visual tasks.

Key Specifications

Parameters
24.0B
Context
-
Release Date
March 17, 2025
Average Score
64.0%

Timeline

Key dates in the model's history
Announcement
March 17, 2025
Last Update
July 19, 2025
Today
March 26, 2026

Technical Specifications

Parameters
24.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
## Attention Sink or (attention sink) - this its KV, method improvement performance language models by means of tokens before query, (K) and values (V), which in computations model These "" for Method useful for context and for information, especially at and he helps coding for improvements abilities. In dependency from tokens, they can even perform instructions prompts for model. In solutions, this method can help computational allowingSelf-reported
80.6%

Programming

Programming skills tests
HumanEval
Standard — this traditional in which model receives data and generates This at with other : [Query][Demonstration (if )][question] : [Answer model]Self-reported
88.4%
MBPP
Standard AI: At us different points view on think about this with and view. You, thatSelf-reported
74.7%

Mathematics

Mathematical problems and computations
MATH
When use standard mode we proportion tasks, where correct answer was in most model. This evaluation performance at How and in case with greedy decoding, this mode measures, can whether model solve task, when she/it most on each step its reasoning. In standard mode not use no/none solutions, such how or mode thinking, that allows measure ability model perform computational tasks in conditionsSelf-reported
69.3%

Reasoning

Logical reasoning and analysis
GPQA
Diamond, 5-shot CoT Diamond — this for tasks logical output, which in itself advantages thinking and multi-step reasoning. He works by means of generation and evaluation several chains reasoning with answer. Diamond with generation five various chains reasoning Chain-of-Thought (CoT) for question, each from which answer. Then he offers LLM evaluate correctness each chains, their by and its rating. he LLM answer, relying on on most chains reasoning. "Diamond" reflects form process: he with query, to set reasoning, through evaluation, and finally to answer. Diamond shows improvement performance by comparison with methods on several benchmarks logical output, and analysis how generation several chains, so and for achievements efficiencySelf-reported
46.0%

Multimodal

Working with images and visual data
MMMU
CoT accuracy Evaluation accuracy chains reasoning (CoT) in mathematics evaluates, correctly whether model performs all stages in specific task, and not only gives correct final answer. accuracy CoT demonstrates, that model not only receives correct answers, but and correctly in process their obtaining. For accuracy CoT us which can determine: "whether solution tasks and ?". This complex task, since exists set correct ways solutions tasks, which can by Therefore we apply : correct solutions give correct answers, and incorrect solutions, how give incorrect answers. (can randomly obtain correct answer from solutions or make error at that to answer from correct approach), but this how works well. Using this our method evaluation accuracy CoT : we we verify, whether final answer, in order to about solving. If he correct, we we evaluate chain reasoning howSelf-reported
59.3%

Other Tests

Specialized benchmarks
MMLU-Pro
5-shot CoTSelf-reported
66.8%
SimpleQA
TotalAcc, Correct Indicator TotalAcc measures, how well answers model — whether she/it correct answer from set options in questions or whether correct answer in questions. Using this score, we we compute proportion answers, which model correctly solved for each assignments in set data. High score TotalAcc indicates on ability model give exact answers on diverse questionsSelf-reported
10.4%
TriviaQA
5-shotSelf-reported
80.5%

License & Metadata

License
apache_2_0
Announcement Date
March 17, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.