Ministral 8B Instruct

Name: Ministral 8B Instruct
Author: Mistral AI

Mistral AI

Ministral-8B-Instruct-2410 is an instruction-tuned model for local intelligence, on-device computing, and edge use cases, significantly outperforming existing models of similar size.

Key Specifications

Parameters

8.0B

Context

128.0K

Release Date

October 16, 2024

Average Score

63.3%

API Documentation Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

October 16, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

8.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.10

Output (per 1M tokens)

$0.10

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

## Layer-Based Attribution: from general concepts to specific set methods, which show general "" each model. This by means of its data or to and then measurement on final result. However these methods can be since or always general model. We instead this we offer more measurement for tasks or concepts. We we offer following general 1. ability or task for analysis. For example, for determination which support mathematical abilities or ability follow instructions. 2. set test examples, which this These examples can be or their can with in order to create specific data. 3. metric for measurement efficiency by concepts. This can be accuracy, between data, or more for example, score by LLM-4. For each (or for example, ) method how below, in order to determine its for concepts. ### Method : We we offer approach, in which we which data selected in order to specific capabilities, maintaining at this general performance model. 1. For L_i or A_i immediately after him. 2. all base model. 3. A_i, in order to metric for concepts on set training examples (from test examples). For example, A_i, in order to model incorrect answers on mathematical tasks, at this answers. 4. Important also which • Self-reported

65.0%

Winogrande

For benchmark GPQA we and used new method evaluation, in order to with at verification complex questions with answer. methods, such how verification by or with answers, not evaluate accuracy reasoning LLM in complex subject fields with answers. evaluation that, that people not in all fields. We developed method, evaluation with help process (Process-Based Evaluation), which allows evaluate answers LLM without preliminary set "" answers. This method based on two key : 1. evaluation, process step for step, which should follow for determination correctness answer. 2. Use strong LLM in capacity which this for each answer. First we developed evaluation with experts for each domain field, process determination correctness answers. Then we GPT-4 perform these evaluation for each answer. In order to verify reliability this method, we: • obtained from GPT-4, with evaluationexperts-people by questions • that evaluation GPT-4 well with evaluation(= 0.83) • evaluation : details method evaluation will in which will together with set data • Self-reported

75.3%

Programming

Programming skills tests

HumanEval

tool use with help queries (Auto-Tool Use via External Queries) In this approach model can queries to API for obtaining These queries can be in tools, such how Python or other knowledge. Model these queries automatically, that means, that she/it solves, when necessary use tool, query and result. This method on reasoning, but usually with more set tools and with in strategies execution. Approach from that, that model can computation or search information, when this necessary, and then include results in its answer. Limitations include queries to or from queries and interpretation results. If model makes error in one from these final result can be • Self-reported

34.8%

Mathematics

Mathematical problems and computations

MATH

# evaluation understanding models mathematical reasoning ## In this work we we present evaluation understanding LLM mathematical reasoning. not requires, in order to access to LLM, and uses only through API in form questions and answers. ## consists from three : **1: set mathematical tasks** We we choose set tasks from mathematical such how AIME, American Mathematics Competition and IMO, and their for provision model. **2: Query solutions with different ** We we ask model solve each task with different : 1. **"solve directly"**: ask model solve task. 2. **"step-by-step solution"**: model solve task step for step. 3. **"prompt"**: model prompt and ask solution. 4. **"make "**: model answer and ask verify its. **3: errors and evaluation understanding** We we analyze answers model, errors by following : - **errors**: Model not understands main mathematical concepts. - **errors**: Model understands concepts, but makes errors in **errors**: Model understands method, but allows errors. - **verification**: Model not can correctly verify solution, even if it answer. ## evaluation For each model we : 1. **score **: correctly solved tasks. 2. **Indicator understanding**: tasks without errors. 3. **Indicator accuracy**: tasks without errors among those, where no errors. 4. **Indicator accuracy**: tasks without computational errors among those, where no and errors. 5. **Indicator abilities • Self-reported

54.5%

Other Tests

Specialized benchmarks

AGIEval

We new for verification performance language models, which we "" (probability truncation). In difference from methods, works with already model, evaluating their for identification errors. consists in that, that errors model often with or in specific reasoning. gives model capability : "whether I in step, in order to ?", that allows us and even correct errors reasoning. method includes three step: 1) Standard generation output with model with temperature and format 2) Computation probability each in text 3) Application for with we we can determine, where specifically model makes errors or This gives and capability model about reasoning for obtaining answers. example: model solves task and in "2 + 3 = 6". Analysis tokens can show, that probability "6" was on (0.3), that indicates on then, that model was not very in this step, on in answer • Self-reported

48.3%

ARC-C

# models language In this we we consider interpretation models language. These methods for understanding capabilities, limitations and models. from these approaches on degree — description model its own work or processes thinking. ## from models generation or that, how model solve specific tasks. For example, if LLM demonstrates high performance on mathematical tasks, we we can "how" she/it solves these tasks. She/It uses internal computation, trained or then approaches? ### from basic methods interpretation — ask model explain its process reasoning. This method, also known how "" or "", includes in itself query model: 1. about its general capabilities or 2. specific steps reasoning, at solving tasks 3. its own computation and explain errors Although provides descriptions they can be or Models can explanations, which but not their internal ### through training More structured approach to from models consists in training model for interpretation first model. For example, can one model behavior other model by then use model for generation explanations behavior model. This method can help identify patterns in computations model, which can be not from direct not less, accuracy and such approaches research. ## behavior behavior in models, such how generation or /( • Self-reported

71.9%

Arena Hard

• Self-reported

70.9%

French MMLU

# speed reasoning If we we compare LLM on basis accuracy in tests on reasoning, this not accounts for their efficiency. For example, some model can use very computation for achievements specific level accuracy, how other can such indeed accuracy In this we evaluation speed reasoning LLM in context efficiency. We we determine **reasoning** model LLM how her/its ability perform tasks on reasoning with specific accuracy, using computational ## method measurement speed reasoning We we offer approach for measurement speed reasoning LLM, various levels : 1. **Definition set**: set, which: - tasks, requiring reasoning - metrics evaluation (for example, correctness, ) - size for 2. **models for evaluation**: several models LLM, sizes (from to large) and 3. **evaluation**: evaluation: - Standard format queries - generation (temperature, top-p) - number examples for few-shot testing 4. **results**: For each model : - Accuracy on test set - computational (number number tokens context, FLOP on ) 5. **performance**: between accuracy and 6. **Analysis efficiency**: for measurement speed reasoning: - More means more high reasoning - points where computations gives ## metrics for speed reasoning For measurement speed reasoning we we offer following • Self-reported

57.5%

MBPP pass@1

# Method for Behavioral Analysis ## Introduction This document outlines the methodology for a systematic behavioral analysis of LLM performance on mathematical reasoning tasks. Our approach combines detailed error analysis with examination of reasoning patterns to understand the foundational capabilities and limitations of these models. ## Analysis Approach ### 1. Error Classification We categorize errors into a multi-level taxonomy: - **Conceptual errors**: Fundamental misunderstandings of mathematical concepts - **Procedural errors**: Mistakes in executing calculation steps - **Reasoning errors**: Logical fallacies or invalid deductive steps - **Attention errors**: Failures to track or maintain relevant information ### 2. Reasoning Pattern Analysis We examine: - **Solution structure**: The overall approach to problem decomposition - **Verification behavior**: How models check their work and handle uncertainty - **Tool usage patterns**: When and how models leverage external calculation tools ### 3. Comparative Analysis We contrast performance across: - Different model architectures and sizes - Various prompt formats and system instructions - Problems of increasing complexity within the same domain ## Implementation Details The analysis uses a combination of: - Manual evaluation by mathematics experts - Automated pattern detection using custom parsing algorithms - Standardized evaluation metrics across different problem types - Cross-validation between different evaluators to ensure reliability This methodology provides both quantitative metrics and qualitative insights into model behavior, revealing not just what models get wrong, but why they fail in specific ways. • Self-reported

70.0%

MT-Bench

Evaluation AI: Evaluation • Self-reported

83.0%

TriviaQA

# "Verification with using " ## When we about answers for general questions with context, consists in that, in order to determine, is whether this answer model fully on basis context. this how that, whether answer fully "inside " context, partially in or fully its. — for such thinking. ## process 1. First set /facts, in answer model. 2. For each whether it: - **Fully in context**: All details explicitly context. - **Partially in context**: Some aspects but is additional or nuances, in context. - **context**: not 3. Especially on: - numbers, statements - how facts - for context 4. answer model and context how two in : - In answer should be fully context - part answer, for context, is 5. general answer: - **Fully **: All inside "context" - **Partially **: Some for context - **/not **: or all context ## Advantages - model for analysis evaluate parts answer by to Especially for identification cases, when model from context with its ## Limitations - Can be determine in information • Self-reported

65.5%

License & Metadata

License

mistral_research_license

Announcement Date

October 16, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Gemma 2 9B

Google

9.2B

Best score:0.7 (MMLU)

Released:Jun 2024

Llama 3.2 3B Instruct

Llama 3.1 Nemotron Nano 8B V1

NVIDIA

8.0B

Best score:0.5 (GPQA)

Released:Mar 2025

Phi 4 Mini

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Feb 2025

Phi-3.5-mini-instruct

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Aug 2024

Price:$0.10/1M tokens

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Phi 4 Mini Reasoning

Microsoft

3.8B

Best score:0.5 (GPQA)

Released:Apr 2025

Qwen2 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Jul 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.