Mistral Small 3 24B Base
MultimodalMistral Small 3 is competitive with larger models like Llama 3.3 70B or Qwen 32B and is an excellent open alternative to closed proprietary models like GPT4o-mini. Mistral Small 3 matches the quality of Llama 3.3 70B instruct while running more than 3x faster on the same hardware.
Key Specifications
Parameters
23.6B
Context
-
Release Date
January 30, 2025
Average Score
67.0%
Timeline
Key dates in the model's history
Announcement
January 30, 2025
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
23.6B
Training Tokens
-
Knowledge Cutoff
October 1, 2023
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
5-shot • Self-reported
Programming
Programming skills tests
MBPP
Pass@1 Metric Pass@1 measures, how many times model solve task with first attempts. In order to verify, how well well model solves task without attempts, we we can measure proportion answers, which were correct with first times. For each tasks k: - We we ask model give one answer Ak. - We we evaluate, is whether Ak correct (1 = correctly, 0 = incorrectly). Pass@1 = (tasks, solved with first attempts) / (number tasks) This metric important, since she/it shows, how well users can answer model, not several attempts or verification. High score Pass@1 indicates on model, which solves tasks with first times, that critically important for many real applications • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
5-shot, maj@1 For each tasks we model 5 times and answer majority (or from most often answers in case ). can help model, which from-for process tokens. This queries requires 5 model on each task • Self-reported
MATH
5-shot, MaJ • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
5-shot, CoT In this method model first is provided 5 examples solutions various tasks with using approach "chains reasoning" (Chain of Thought). This allows model steps reasoning before that, how she/it solve new task. This method combines training (few-shot learning) and chains reasoning, in order to ability model to solving complex tasks. several examples reasoning, model can identify templates solutions tasks and apply their to new 5-shot, CoT especially efficient for tasks, requiring multi-step reasoning, such how mathematical tasks, logical puzzles or tasks, requiring analysis. Examples in context how model, how structure its thoughts and break down complex tasks on managed steps • Self-reported
Other Tests
Specialized benchmarks
AGIEval
# represents itself process understanding text for extraction information, for execution tasks. For example, at solving mathematical tasks first need to tasks, then identify and goal — then, that we find. ## Example step-by-step process: 1. **all text**: - with general understanding text. 2. **components**: - **tasks**: that specifically ****: all key and their values. - ****: that specifically is required find. 3. **information**: - data in format. - between various 4. ****: - that all important components. - that not information. especially useful for tasks, requiring extraction information, such how mathematical tasks, scientific or helps by means of text on managed • Self-reported
ARC-C
0-shot AI: at which model is provided task without any-or examples or additional context. These tasks basic knowledge model and understanding instructions. For tasks, requiring specialized knowledge, more or model can in 0-shot mode, in then time how new model, trained on more diverse and specialized data, often can even without additional prompts. Example: "numbers from 1 to 100" without any-or additional instructions • Self-reported
MMLU-Pro
0-shot CoT In this approach, on method "chain thinking" (chain-of-thought), we directly we ask model solve task, her/its on stages thinking, without provision examples such process. Usually are used prompts "Let's solve this task step for step" or "Let's think step by step", which model generate intermediate reasoning before final answer. This method efficient for models, sufficiently in order to independently chains reasoning. He allows model structure thinking without necessity in training examples, that makes approach more and less from specific examples • Self-reported
TriviaQA
5-shot We we use several examples for more demonstrations. use "k" examples usually is called k-shot In given case we 5-shot providing model 5 examples demonstrations, before than she/it generates output. that number examples often improves performance for provision model information about task and format answer. However, by number examples, and very number examples can even lead to to performance from-for limitations or model • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
January 30, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsMistral Small 3.2 24B Instruct
Mistral AI
MM23.6B
Best score:0.9 (HumanEval)
Released:Jun 2025
Pixtral-12B
Mistral AI
MM12.4B
Best score:0.7 (HumanEval)
Released:Sep 2024
Price:$0.15/1M tokens
Mistral Small 3.1 24B Instruct
Mistral AI
MM24.0B
Best score:0.9 (HumanEval)
Released:Mar 2025
Magistral Medium
Mistral AI
MM24.0B
Best score:0.7 (GPQA)
Released:Jun 2025
Mistral Small 3.1 24B Base
Mistral AI
MM24.0B
Best score:0.8 (MMLU)
Released:Mar 2025
Price:$0.10/1M tokens
Mistral Small 3 24B Instruct
Mistral AI
24.0B
Best score:0.8 (HumanEval)
Released:Jan 2025
Price:$0.10/1M tokens
Mistral NeMo Instruct
Mistral AI
12.0B
Best score:0.7 (MMLU)
Released:Jul 2024
Price:$0.15/1M tokens
Magistral Small 2506
Mistral AI
24.0B
Best score:0.7 (GPQA)
Released:Jun 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.