Microsoft logo

Phi-3.5-mini-instruct

Microsoft

Phi-3.5-mini-instruct is a model with 3.8 billion parameters that supports up to 128 thousand token context window and features improved multilingual capabilities for over 20 languages. The model underwent additional training and post-training for safety to improve instruction following, reasoning, mathematical computation, and code generation. It is ideal for memory-constrained or latency-sensitive environments and uses the MIT license.

Key Specifications

Parameters
3.8B
Context
128.0K
Release Date
August 23, 2024
Average Score
58.7%

Timeline

Key dates in the model's history
Announcement
August 23, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
3.8B
Training Tokens
3.4T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.10
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
5-shot 1. provides model 5 tasks together with solutions for them and 2. Then provides model new task for solutions, in that indeed format, that and 5 examples. 3. Efficiency 5-shot allows evaluate, how well well model can reasoning, several examples. 4. This method can use for testing various types reasoning with help various examples (for example, programming, evidence, mathematics). 5. 5-shot gives more evaluation abilities model follow reasoning, than zero-shotSelf-reported
69.4%
MMLU
5-shot evaluation We also we evaluate model on standard benchmarks, using other 5-shot evaluation, and we consider examples in context questions. On 3 results. During all benchmarks (ARC Easy) examples to by comparison with 0-shot. We models in Claude, where performance with model. for GPT we in dependency from tasks. We also we evaluate, how performance from number examples (from 0 to 5), that on 4. models significantly from first example, with but performance at additional examples. However we differences between models. GPT shows more high performance in 0-shot mode, but at examples, in then time how Claude demonstrates more performance in 0-shot mode, but receives from examples. These results between 0-shot and for improvements with help examples. in appendix F we also how in prompt on performance at few-shot trainingSelf-reported
69.0%
TruthfulQA
10-shot This method provision model examples (10 assignments with solutions) for that, in order to she/it could better understand task before solution new problems. These examples usually include in itself tasks with detailed solutions, that helps model identify templates and strategies, to task. 10-shot especially useful for complex tasks, where solution without preliminary with examples can be Method ensures model understanding format solutions and approaches to solving tasks specific In difference from zero-shot (solution without examples) and few-shot (with several examples), 10-shot provides more set examples, that significantly ability model to generalization and methods solutionsSelf-reported
64.0%
Winogrande
5-shotSelf-reported
68.5%

Programming

Programming skills tests
HumanEval
0-shot AI: only on given instructions. We to how to «0-shot» since model not receives examples for execution tasks. Such approach by zero-shot where system should execute task without any-or examples that, how this do. In some cases, this approach can be so how he allows from-for examples, which can be or In analysis we that 0-shot gives more results by comparison with example for GPT-4, but ensures significantly understanding assignments. Therefore we use this method, when accuracy execution assignments than general quality answerSelf-reported
62.8%
MBPP
3-shotSelf-reported
69.6%

Mathematics

Mathematical problems and computations
GSM8k
8-shot chain-of-thoughtSelf-reported
86.2%
MATH
0-shot chain-of-thought AI: 0-shot chain-of-thoughtSelf-reported
48.5%
MGSM
0-shot chain-of-thought AI: 0-shot chain-of-thoughtSelf-reported
47.9%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
0-shot chain-of-thought AI: 0-shot chain-of-thoughtSelf-reported
69.0%
GPQA
0-shot chain-of-thought AI: In mode training with chain reasoning (0-shot chain-of-thought) solves problem, using step-by-step reasoning without examples that, how solve tasks. Model intermediate steps reasoning before provision final answer. This approach especially useful for mathematical, logical and tasks, where model should break down complex problem on sequence more simple stepsSelf-reported
30.4%

Other Tests

Specialized benchmarks
ARC-C
10-Self-reported
84.6%
Arena Hard
standard evaluationSelf-reported
37.0%
BoolQ
2-shot In this approach model first several examples (usually two) solutions similar tasks, that helps it understand format and way solutions, before than she/it solve task. This training in context, which allows model its answers on basis examplesSelf-reported
78.0%
GovReport
standard evaluationSelf-reported
25.9%
MEGA MLQA
standard evaluationSelf-reported
61.7%
MEGA TyDi QA
standard evaluationSelf-reported
62.2%
MEGA UDPOS
standard evaluationSelf-reported
46.5%
MEGA XCOPA
standard evaluationSelf-reported
63.1%
MEGA XStoryCloze
standard evaluationSelf-reported
73.5%
MMLU-Pro
0-shot chain-of-thought AI: Method 0-shot chain-of-thought (0-shot CoT) encourages model intermediate steps reasoning before provision final answer, not using at this examples. This by means of simple prompts, such how "Let's reason step for step", to By comparison with 0-shot which direct answer, 0-shot CoT improves performance on tasks, requiring complex reasoning, especially for large language models. Although 0-shot CoT not so efficient, how few-shot CoT (where model show examples reasoning), he significantly in since not requires creation examples reasoning for each new tasks. this method its tool for improvements performance in tasks reasoningSelf-reported
47.4%
MMMLU
5-shot evaluation several ways evaluation modern language models (LLM) on new tasks. from such approaches, which we used in this this «k-shot evaluation»: model is provided k examples execution tasks before that, how she/it solve new example. We this «5-shot» evaluation, when model 5 examples. How this works? For demonstration, several examples us k-shot query for models: Task consists in that, is whether in between AI or represents itself «». For Claude and GPT-4, our query with descriptions tasks and instructions, and then are provided k=5 examples with correct answers. Then model on manner, k-shot evaluation allows us evaluate, how well well model can solution tasks on basis number examples, it. This especially useful for understanding abilities LLM to training in contextSelf-reported
55.4%
OpenBookQA
10-AI: 10-Self-reported
79.2%
PIQA
5-shot set from 5 tasks and answers LLM for each tasks, on which from tasks LLM correct answers. First, for each tasks thoroughly problem independently and correct answer. intermediate steps model and answer, in order to determine, correctly whether solution model. Then solution model with and errors following manner: 1. errors — model makes actually statement 2. errors — model uses incorrect methods, evidence or applies incorrect manner 3. errors — model allows errors in computations 4. errors — in reasoning model logical errors 5. solutions — model provides answer After analysis answers on all 5 tasks, tasks, to which model fully correct answersSelf-reported
81.0%
Qasper
standard evaluationSelf-reported
41.9%
QMSum
standard evaluationSelf-reported
21.3%
RepoQA
AverageSelf-reported
77.0%
RULER
128k AI: This paper proposes the use of large context windows (128K tokens) to allow language models like GPT-4 to process and utilize large amounts of relevant information at once, which can dramatically enhance performance on difficult reasoning tasks. Description of the method: When faced with a complex problem, the 128K approach involves: 1. First gathering extensive high-quality relevant information (examples, theorems, techniques, etc.) 2. Structuring this information so the model can readily access it 3. Providing the problem along with all this context in a single prompt 4. Having the model reason through the problem with all resources available simultaneously This method leverages the model's ability to attend to any part of the 128K context window at any time during its reasoning process. It's especially effective for problems requiring specialized knowledge, complex reasoning, or access to multiple examples. The authors highlight that this approach eliminates the need for complex tool use, agent architectures, or retrieval augmentation strategies in many cases - by simply giving the model everything it might need upfront in a well-structured format. The performance improvements are particularly notable on tasks like GPQA (technical questions across STEM fields), MMLU (professional knowledge benchmarks), and mathematical problem-solving competitions.Self-reported
84.1%
Social IQa
5-shot We step-by-step solution tasks with using five various examples tasks with solutions. These examples were model for each new tasks. showed, that provision examples models generate more exact answers, especially for complex tasks, where steps solutions. Models could template solutions from examples to new tasks. We discovered, that examples with detailed each step were than examples, which simply sequence This models better understand logic solutions and apply her/its to new In our evaluation we percentage correct answers and intermediate steps. we how well model solutions from examples and where they in order to to new tasks. Method 5-shot especially for tasks, requiring specific format answer or approach to solving, allowing models quickly adapt to format without additional instructionsSelf-reported
74.7%
SQuALITY
standard evaluationSelf-reported
24.3%
SummScreenFD
standard evaluationSelf-reported
16.0%

License & Metadata

License
mit
Announcement Date
August 23, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.