Mistral Small 3.1 24B Instruct

Name: Mistral Small 3.1 24B Instruct
Author: Mistral AI

Multimodal

Mistral AI

Building on Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art image understanding and improves long-context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and visual tasks.

Key Specifications

Parameters

24.0B

Context

Release Date

March 17, 2025

Average Score

64.0%

Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

March 17, 2025

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

24.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

## Attention Sink or (attention sink) - this its KV, method improvement performance language models by means of tokens before query, (K) and values (V), which in computations model These "" for Method useful for context and for information, especially at and he helps coding for improvements abilities. In dependency from tokens, they can even perform instructions prompts for model. In solutions, this method can help computational allowing • Self-reported

80.6%

Programming

Programming skills tests

HumanEval

Standard — this traditional in which model receives data and generates This at with other : [Query][Demonstration (if )][question] : [Answer model] • Self-reported

88.4%

MBPP

Standard AI: At us different points view on think about this with and view. You, that • Self-reported

74.7%

Mathematics

Mathematical problems and computations

MATH

When use standard mode we proportion tasks, where correct answer was in most model. This evaluation performance at How and in case with greedy decoding, this mode measures, can whether model solve task, when she/it most on each step its reasoning. In standard mode not use no/none solutions, such how or mode thinking, that allows measure ability model perform computational tasks in conditions • Self-reported

69.3%

Reasoning

Logical reasoning and analysis

GPQA

Diamond, 5-shot CoT Diamond — this for tasks logical output, which in itself advantages thinking and multi-step reasoning. He works by means of generation and evaluation several chains reasoning with answer. Diamond with generation five various chains reasoning Chain-of-Thought (CoT) for question, each from which answer. Then he offers LLM evaluate correctness each chains, their by and its rating. he LLM answer, relying on on most chains reasoning. "Diamond" reflects form process: he with query, to set reasoning, through evaluation, and finally to answer. Diamond shows improvement performance by comparison with methods on several benchmarks logical output, and analysis how generation several chains, so and for achievements efficiency • Self-reported

46.0%

Multimodal

Working with images and visual data

MMMU

CoT accuracy Evaluation accuracy chains reasoning (CoT) in mathematics evaluates, correctly whether model performs all stages in specific task, and not only gives correct final answer. accuracy CoT demonstrates, that model not only receives correct answers, but and correctly in process their obtaining. For accuracy CoT us which can determine: "whether solution tasks and ?". This complex task, since exists set correct ways solutions tasks, which can by Therefore we apply : correct solutions give correct answers, and incorrect solutions, how give incorrect answers. (can randomly obtain correct answer from solutions or make error at that to answer from correct approach), but this how works well. Using this our method evaluation accuracy CoT : we we verify, whether final answer, in order to about solving. If he correct, we we evaluate chain reasoning how • Self-reported

59.3%

Other Tests

Specialized benchmarks

MMLU-Pro

5-shot CoT • Self-reported

66.8%

SimpleQA

TotalAcc, Correct Indicator TotalAcc measures, how well answers model — whether she/it correct answer from set options in questions or whether correct answer in questions. Using this score, we we compute proportion answers, which model correctly solved for each assignments in set data. High score TotalAcc indicates on ability model give exact answers on diverse questions • Self-reported

10.4%

TriviaQA

5-shot • Self-reported

80.5%

License & Metadata

License

apache_2_0

Announcement Date

March 17, 2025

Last Updated

July 19, 2025

Articles about Mistral Small 3.1 24B Instruct

Voxtral TTS Beats ElevenLabs — And It's Open Weight

Mistral AI releases Voxtral TTS, a 4B parameter text-to-speech model that outperforms ElevenLabs in blind tests. Open weights, 9 languages, 90ms latency.

March 28, 2026

5 min

Similar Models

All Models

Mistral Small 3.2 24B Instruct

Mistral AI

MM23.6B

Best score:0.9 (HumanEval)

Released:Jun 2025

Magistral Medium

Mistral AI

MM24.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Mistral Small 3 24B Base

Mistral AI

MM23.6B

Best score:0.9 (ARC)

Released:Jan 2025

Pixtral-12B

Mistral AI

MM12.4B

Best score:0.7 (HumanEval)

Released:Sep 2024

Price:$0.15/1M tokens

Mistral Small 3.1 24B Base

Mistral AI

MM24.0B

Best score:0.8 (MMLU)

Released:Mar 2025

Price:$0.10/1M tokens

Mistral Small 3 24B Instruct

Mistral AI

24.0B

Best score:0.8 (HumanEval)

Released:Jan 2025

Price:$0.10/1M tokens

Mistral NeMo Instruct

Mistral AI

12.0B

Best score:0.7 (MMLU)

Released:Jul 2024

Price:$0.15/1M tokens

Gemma 3 27B

Google

MM27.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Price:$0.11/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.