Key Specifications
Parameters
8.0B
Context
128.0K
Release Date
October 16, 2024
Average Score
63.3%
Timeline
Key dates in the model's history
Announcement
October 16, 2024
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
8.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.10
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
## Layer-Based Attribution: from general concepts to specific set methods, which show general "" each model. This by means of its data or to and then measurement on final result. However these methods can be since or always general model. We instead this we offer more measurement for tasks or concepts. We we offer following general 1. ability or task for analysis. For example, for determination which support mathematical abilities or ability follow instructions. 2. set test examples, which this These examples can be or their can with in order to create specific data. 3. metric for measurement efficiency by concepts. This can be accuracy, between data, or more for example, score by LLM-4. For each (or for example, ) method how below, in order to determine its for concepts. ### Method : We we offer approach, in which we which data selected in order to specific capabilities, maintaining at this general performance model. 1. For L_i or A_i immediately after him. 2. all base model. 3. A_i, in order to metric for concepts on set training examples (from test examples). For example, A_i, in order to model incorrect answers on mathematical tasks, at this answers. 4. Important also which • Self-reported
Winogrande
For benchmark GPQA we and used new method evaluation, in order to with at verification complex questions with answer. methods, such how verification by or with answers, not evaluate accuracy reasoning LLM in complex subject fields with answers. evaluation that, that people not in all fields. We developed method, evaluation with help process (Process-Based Evaluation), which allows evaluate answers LLM without preliminary set "" answers. This method based on two key : 1. evaluation, process step for step, which should follow for determination correctness answer. 2. Use strong LLM in capacity which this for each answer. First we developed evaluation with experts for each domain field, process determination correctness answers. Then we GPT-4 perform these evaluation for each answer. In order to verify reliability this method, we: • obtained from GPT-4, with evaluationexperts-people by questions • that evaluation GPT-4 well with evaluation(= 0.83) • evaluation : details method evaluation will in which will together with set data • Self-reported
Programming
Programming skills tests
HumanEval
tool use with help queries (Auto-Tool Use via External Queries) In this approach model can queries to API for obtaining These queries can be in tools, such how Python or other knowledge. Model these queries automatically, that means, that she/it solves, when necessary use tool, query and result. This method on reasoning, but usually with more set tools and with in strategies execution. Approach from that, that model can computation or search information, when this necessary, and then include results in its answer. Limitations include queries to or from queries and interpretation results. If model makes error in one from these final result can be • Self-reported
Mathematics
Mathematical problems and computations
MATH
# evaluation understanding models mathematical reasoning ## In this work we we present evaluation understanding LLM mathematical reasoning. not requires, in order to access to LLM, and uses only through API in form questions and answers. ## consists from three : **1: set mathematical tasks** We we choose set tasks from mathematical such how AIME, American Mathematics Competition and IMO, and their for provision model. **2: Query solutions with different ** We we ask model solve each task with different : 1. **"solve directly"**: ask model solve task. 2. **"step-by-step solution"**: model solve task step for step. 3. **"prompt"**: model prompt and ask solution. 4. **"make "**: model answer and ask verify its. **3: errors and evaluation understanding** We we analyze answers model, errors by following : - **errors**: Model not understands main mathematical concepts. - **errors**: Model understands concepts, but makes errors in **errors**: Model understands method, but allows errors. - **verification**: Model not can correctly verify solution, even if it answer. ## evaluation For each model we : 1. **score **: correctly solved tasks. 2. **Indicator understanding**: tasks without errors. 3. **Indicator accuracy**: tasks without errors among those, where no errors. 4. **Indicator accuracy**: tasks without computational errors among those, where no and errors. 5. **Indicator abilities • Self-reported
Other Tests
Specialized benchmarks
AGIEval
We new for verification performance language models, which we "" (probability truncation). In difference from methods, works with already model, evaluating their for identification errors. consists in that, that errors model often with or in specific reasoning. gives model capability : "whether I in step, in order to ?", that allows us and even correct errors reasoning. method includes three step: 1) Standard generation output with model with temperature and format 2) Computation probability each in text 3) Application for with we we can determine, where specifically model makes errors or This gives and capability model about reasoning for obtaining answers. example: model solves task and in "2 + 3 = 6". Analysis tokens can show, that probability "6" was on (0.3), that indicates on then, that model was not very in this step, on in answer • Self-reported
ARC-C
# models language In this we we consider interpretation models language. These methods for understanding capabilities, limitations and models. from these approaches on degree — description model its own work or processes thinking. ## from models generation or that, how model solve specific tasks. For example, if LLM demonstrates high performance on mathematical tasks, we we can "how" she/it solves these tasks. She/It uses internal computation, trained or then approaches? ### from basic methods interpretation — ask model explain its process reasoning. This method, also known how "" or "", includes in itself query model: 1. about its general capabilities or 2. specific steps reasoning, at solving tasks 3. its own computation and explain errors Although provides descriptions they can be or Models can explanations, which but not their internal ### through training More structured approach to from models consists in training model for interpretation first model. For example, can one model behavior other model by then use model for generation explanations behavior model. This method can help identify patterns in computations model, which can be not from direct not less, accuracy and such approaches research. ## behavior behavior in models, such how generation or /( • Self-reported
Arena Hard
• Self-reported
French MMLU
# speed reasoning If we we compare LLM on basis accuracy in tests on reasoning, this not accounts for their efficiency. For example, some model can use very computation for achievements specific level accuracy, how other can such indeed accuracy In this we evaluation speed reasoning LLM in context efficiency. We we determine **reasoning** model LLM how her/its ability perform tasks on reasoning with specific accuracy, using computational ## method measurement speed reasoning We we offer approach for measurement speed reasoning LLM, various levels : 1. **Definition set**: set, which: - tasks, requiring reasoning - metrics evaluation (for example, correctness, ) - size for 2. **models for evaluation**: several models LLM, sizes (from to large) and 3. **evaluation**: evaluation: - Standard format queries - generation (temperature, top-p) - number examples for few-shot testing 4. **results**: For each model : - Accuracy on test set - computational (number number tokens context, FLOP on ) 5. **performance**: between accuracy and 6. **Analysis efficiency**: for measurement speed reasoning: - More means more high reasoning - points where computations gives ## metrics for speed reasoning For measurement speed reasoning we we offer following • Self-reported
MBPP pass@1
# Method for Behavioral Analysis
## Introduction
This document outlines the methodology for a systematic behavioral analysis of LLM performance on mathematical reasoning tasks. Our approach combines detailed error analysis with examination of reasoning patterns to understand the foundational capabilities and limitations of these models.
## Analysis Approach
### 1. Error Classification
We categorize errors into a multi-level taxonomy:
- **Conceptual errors**: Fundamental misunderstandings of mathematical concepts
- **Procedural errors**: Mistakes in executing calculation steps
- **Reasoning errors**: Logical fallacies or invalid deductive steps
- **Attention errors**: Failures to track or maintain relevant information
### 2. Reasoning Pattern Analysis
We examine:
- **Solution structure**: The overall approach to problem decomposition
- **Verification behavior**: How models check their work and handle uncertainty
- **Tool usage patterns**: When and how models leverage external calculation tools
### 3. Comparative Analysis
We contrast performance across:
- Different model architectures and sizes
- Various prompt formats and system instructions
- Problems of increasing complexity within the same domain
## Implementation Details
The analysis uses a combination of:
- Manual evaluation by mathematics experts
- Automated pattern detection using custom parsing algorithms
- Standardized evaluation metrics across different problem types
- Cross-validation between different evaluators to ensure reliability
This methodology provides both quantitative metrics and qualitative insights into model behavior, revealing not just what models get wrong, but why they fail in specific ways. • Self-reported
MT-Bench
Evaluation
AI:
Evaluation • Self-reported
TriviaQA
# "Verification with using " ## When we about answers for general questions with context, consists in that, in order to determine, is whether this answer model fully on basis context. this how that, whether answer fully "inside " context, partially in or fully its. — for such thinking. ## process 1. First set /facts, in answer model. 2. For each whether it: - **Fully in context**: All details explicitly context. - **Partially in context**: Some aspects but is additional or nuances, in context. - **context**: not 3. Especially on: - numbers, statements - how facts - for context 4. answer model and context how two in : - In answer should be fully context - part answer, for context, is 5. general answer: - **Fully **: All inside "context" - **Partially **: Some for context - **/not **: or all context ## Advantages - model for analysis evaluate parts answer by to Especially for identification cases, when model from context with its ## Limitations - Can be determine in information • Self-reported
License & Metadata
License
mistral_research_license
Announcement Date
October 16, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsGemma 2 9B
9.2B
Best score:0.7 (MMLU)
Released:Jun 2024
Llama 3.2 3B Instruct
Meta
3.2B
Best score:0.8 (ARC)
Released:Sep 2024
Price:$0.01/1M tokens
Llama 3.1 Nemotron Nano 8B V1
NVIDIA
8.0B
Best score:0.5 (GPQA)
Released:Mar 2025
Phi 4 Mini
Microsoft
3.8B
Best score:0.8 (ARC)
Released:Feb 2025
Phi-3.5-mini-instruct
Microsoft
3.8B
Best score:0.8 (ARC)
Released:Aug 2024
Price:$0.10/1M tokens
Qwen2.5 7B Instruct
Alibaba
7.6B
Best score:0.8 (HumanEval)
Released:Sep 2024
Price:$0.30/1M tokens
Phi 4 Mini Reasoning
Microsoft
3.8B
Best score:0.5 (GPQA)
Released:Apr 2025
Qwen2 7B Instruct
Alibaba
7.6B
Best score:0.8 (HumanEval)
Released:Jul 2024
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.