Magistral Medium
MultimodalMagistral Medium, trained exclusively using reinforcement learning on Mistral Medium 3, is a reasoning model that demonstrates high performance on complex math and coding problems without using distillation from existing reasoning models. Training uses the RLVR framework with GRPO modifications, providing improved reasoning abilities and multilingual consistency.
Key Specifications
Parameters
24.0B
Context
-
Release Date
June 10, 2025
Average Score
52.6%
Timeline
Key dates in the model's history
Announcement
June 10, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
24.0B
Training Tokens
-
Knowledge Cutoff
June 1, 2025
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
Reasoning
Logical reasoning and analysis
GPQA
# Diamond Diamond (Diamond) - this method step-by-step reasoning through solutions tasks, which advantages two methods reasoning with : and think and which and errors in solving. ## Diamond with solutions given tasks (), then several independent solutions (most part ), and finally all solutions in and solution (). ## How apply Diamond 1. **solution:** solution tasks, using methods thinking, for example internal solution, if this possible. 2. **:** several independent solutions that indeed tasks with for example using: - methods solutions - strategies verification 3. **:** all solutions: - answers and conclusions about correctness - errors or solutions, most correct and approach - solution 4. **solution:** its solution tasks, on all approaches or on version best solutions. ## use Diamond especially useful for: - tasks, requiring accuracy - where can or easily details - which can solve different ## Strong side - reliability and by comparison with methods - Allows errors through several approaches - more high confidence in solving ## Example application For tasks computation : 1. **solution:** V = (4/3)πr³ 2. **:** Solve, using and etc.etc. 3. **:** • Self-reported
Other Tests
Specialized benchmarks
Aider-Polyglot
accuracy • Self-reported
AIME 2024
pass@1 tasks with first attempts (pass@1) measures proportion tasks, which model solves correctly with first attempts. This strict metric, since she/it evaluates probability obtaining correct answer only for one attempt. For computation pass@1: 1. Model generates one solution for each tasks 2. solution is evaluated how correct or 3. pass@1 = (number correctly solved tasks) / (general number tasks) This metric important, since she/it reflects ability model give correct answers without several attempts or iterations, that makes her/its especially for evaluation model in real scenarios application • Self-reported
AIME 2025
with first attempts AI: Translate following text descriptions method analysis model AI on Russian language, rules: 1. on language. 2. all and in form (for example: GPT, LLM, API, AIME, GPQA). (for example: "thinking mode" → "mode thinking", "tools" → "tools"). 3. and 4. descriptions. 5. Not information, only then, that maintaining all details. 6. models (for example "GPT-5 nano", "Claude") on 7. benchmarks and on (for example: "AIME", "FrontierMath", "Harvard-MIT Mathematics Tournament"). 8. should be on text, 9. explanations, quotes or — on ONLY translation • Self-reported
Humanity's Last Exam
text text when model answers on query text, which is exact this query. This problem, because that users in order to model them and reasoning, and not simply part their text them. For identification text text answer model with text query. If answer contains in accuracy those indeed that and from query, this how text. phrases or cases, when query its part • Self-reported
LiveCodeBench
## Method analysis: This method analysis demonstrates, how LLM generates answers and helps evaluate, how well model determines Method based on for that, how model processes statements. ### method: 1. ****: and its **Example**: - ****: "in ". - ****: "in ". 2. **Query to model**: model in statements (that it ). 3. **Analysis abilities critically think**: can whether model their actual ### Value method: This method allows evaluate: - Ability model at work with How well well model can explain its reasoning - model between query and actual accuracy ### example: ``` Query: why in and not in Answer model: Although on in I I can why she/it could would in : 1. : that after in 1851 create which would 2. : In where and in XVI was in how this 3. with other : on between and in which was Important note, that these are was • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
June 10, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsMistral Small 3.1 24B Instruct
Mistral AI
MM24.0B
Best score:0.9 (HumanEval)
Released:Mar 2025
Mistral Small 3.2 24B Instruct
Mistral AI
MM23.6B
Best score:0.9 (HumanEval)
Released:Jun 2025
Mistral Small 3 24B Base
Mistral AI
MM23.6B
Best score:0.9 (ARC)
Released:Jan 2025
Pixtral-12B
Mistral AI
MM12.4B
Best score:0.7 (HumanEval)
Released:Sep 2024
Price:$0.15/1M tokens
Mistral Small 3.1 24B Base
Mistral AI
MM24.0B
Best score:0.8 (MMLU)
Released:Mar 2025
Price:$0.10/1M tokens
GPT OSS 20B
OpenAI
MM20.0B
Best score:0.9 (MMLU)
Released:Aug 2025
Price:$0.10/1M tokens
Qwen2.5 VL 32B Instruct
Alibaba
MM33.5B
Best score:0.9 (HumanEval)
Released:Feb 2025
Mistral NeMo Instruct
Mistral AI
12.0B
Best score:0.7 (MMLU)
Released:Jul 2024
Price:$0.15/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.