Mistral Small 3.1 24B Instruct
MultimodalBuilding on Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art image understanding and improves long-context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and visual tasks.
Key Specifications
Parameters
24.0B
Context
-
Release Date
March 17, 2025
Average Score
64.0%
Timeline
Key dates in the model's history
Announcement
March 17, 2025
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
24.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
## Attention Sink or (attention sink) - this its KV, method improvement performance language models by means of tokens before query, (K) and values (V), which in computations model These "" for Method useful for context and for information, especially at and he helps coding for improvements abilities. In dependency from tokens, they can even perform instructions prompts for model. In solutions, this method can help computational allowing • Self-reported
Programming
Programming skills tests
HumanEval
Standard — this traditional in which model receives data and generates This at with other : [Query][Demonstration (if )][question] : [Answer model] • Self-reported
MBPP
Standard AI: At us different points view on think about this with and view. You, that • Self-reported
Mathematics
Mathematical problems and computations
MATH
When use standard mode we proportion tasks, where correct answer was in most model. This evaluation performance at How and in case with greedy decoding, this mode measures, can whether model solve task, when she/it most on each step its reasoning. In standard mode not use no/none solutions, such how or mode thinking, that allows measure ability model perform computational tasks in conditions • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Diamond, 5-shot CoT Diamond — this for tasks logical output, which in itself advantages thinking and multi-step reasoning. He works by means of generation and evaluation several chains reasoning with answer. Diamond with generation five various chains reasoning Chain-of-Thought (CoT) for question, each from which answer. Then he offers LLM evaluate correctness each chains, their by and its rating. he LLM answer, relying on on most chains reasoning. "Diamond" reflects form process: he with query, to set reasoning, through evaluation, and finally to answer. Diamond shows improvement performance by comparison with methods on several benchmarks logical output, and analysis how generation several chains, so and for achievements efficiency • Self-reported
Multimodal
Working with images and visual data
MMMU
CoT accuracy Evaluation accuracy chains reasoning (CoT) in mathematics evaluates, correctly whether model performs all stages in specific task, and not only gives correct final answer. accuracy CoT demonstrates, that model not only receives correct answers, but and correctly in process their obtaining. For accuracy CoT us which can determine: "whether solution tasks and ?". This complex task, since exists set correct ways solutions tasks, which can by Therefore we apply : correct solutions give correct answers, and incorrect solutions, how give incorrect answers. (can randomly obtain correct answer from solutions or make error at that to answer from correct approach), but this how works well. Using this our method evaluation accuracy CoT : we we verify, whether final answer, in order to about solving. If he correct, we we evaluate chain reasoning how • Self-reported
Other Tests
Specialized benchmarks
MMLU-Pro
5-shot CoT • Self-reported
SimpleQA
TotalAcc, Correct Indicator TotalAcc measures, how well answers model — whether she/it correct answer from set options in questions or whether correct answer in questions. Using this score, we we compute proportion answers, which model correctly solved for each assignments in set data. High score TotalAcc indicates on ability model give exact answers on diverse questions • Self-reported
TriviaQA
5-shot • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
March 17, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsMistral Small 3.2 24B Instruct
Mistral AI
MM23.6B
Best score:0.9 (HumanEval)
Released:Jun 2025
Magistral Medium
Mistral AI
MM24.0B
Best score:0.7 (GPQA)
Released:Jun 2025
Mistral Small 3 24B Base
Mistral AI
MM23.6B
Best score:0.9 (ARC)
Released:Jan 2025
Pixtral-12B
Mistral AI
MM12.4B
Best score:0.7 (HumanEval)
Released:Sep 2024
Price:$0.15/1M tokens
Mistral Small 3.1 24B Base
Mistral AI
MM24.0B
Best score:0.8 (MMLU)
Released:Mar 2025
Price:$0.10/1M tokens
Mistral Small 3 24B Instruct
Mistral AI
24.0B
Best score:0.8 (HumanEval)
Released:Jan 2025
Price:$0.10/1M tokens
Mistral NeMo Instruct
Mistral AI
12.0B
Best score:0.7 (MMLU)
Released:Jul 2024
Price:$0.15/1M tokens
Gemma 3 27B
MM27.0B
Best score:0.9 (HumanEval)
Released:Mar 2025
Price:$0.11/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.