Phi-3.5-mini-instruct

Name: Phi-3.5-mini-instruct
Author: Microsoft

Microsoft

Phi-3.5-mini-instruct is a model with 3.8 billion parameters that supports up to 128 thousand token context window and features improved multilingual capabilities for over 20 languages. The model underwent additional training and post-training for safety to improve instruction following, reasoning, mathematical computation, and code generation. It is ideal for memory-constrained or latency-sensitive environments and uses the MIT license.

Key Specifications

Parameters

3.8B

Context

128.0K

Release Date

August 23, 2024

Average Score

58.7%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

August 23, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

3.8B

Training Tokens

3.4T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.10

Output (per 1M tokens)

$0.10

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

5-shot 1. provides model 5 tasks together with solutions for them and 2. Then provides model new task for solutions, in that indeed format, that and 5 examples. 3. Efficiency 5-shot allows evaluate, how well well model can reasoning, several examples. 4. This method can use for testing various types reasoning with help various examples (for example, programming, evidence, mathematics). 5. 5-shot gives more evaluation abilities model follow reasoning, than zero-shot • Self-reported

69.4%

MMLU

5-shot evaluation We also we evaluate model on standard benchmarks, using other 5-shot evaluation, and we consider examples in context questions. On 3 results. During all benchmarks (ARC Easy) examples to by comparison with 0-shot. We models in Claude, where performance with model. for GPT we in dependency from tasks. We also we evaluate, how performance from number examples (from 0 to 5), that on 4. models significantly from first example, with but performance at additional examples. However we differences between models. GPT shows more high performance in 0-shot mode, but at examples, in then time how Claude demonstrates more performance in 0-shot mode, but receives from examples. These results between 0-shot and for improvements with help examples. in appendix F we also how in prompt on performance at few-shot training • Self-reported

69.0%

TruthfulQA

10-shot This method provision model examples (10 assignments with solutions) for that, in order to she/it could better understand task before solution new problems. These examples usually include in itself tasks with detailed solutions, that helps model identify templates and strategies, to task. 10-shot especially useful for complex tasks, where solution without preliminary with examples can be Method ensures model understanding format solutions and approaches to solving tasks specific In difference from zero-shot (solution without examples) and few-shot (with several examples), 10-shot provides more set examples, that significantly ability model to generalization and methods solutions • Self-reported

64.0%

Winogrande

5-shot • Self-reported

68.5%

Programming

Programming skills tests

HumanEval

0-shot AI: only on given instructions. We to how to «0-shot» since model not receives examples for execution tasks. Such approach by zero-shot where system should execute task without any-or examples that, how this do. In some cases, this approach can be so how he allows from-for examples, which can be or In analysis we that 0-shot gives more results by comparison with example for GPT-4, but ensures significantly understanding assignments. Therefore we use this method, when accuracy execution assignments than general quality answer • Self-reported

62.8%

MBPP

3-shot • Self-reported

69.6%

Mathematics

Mathematical problems and computations

GSM8k

8-shot chain-of-thought • Self-reported

86.2%

MATH

0-shot chain-of-thought AI: 0-shot chain-of-thought • Self-reported

48.5%

MGSM

0-shot chain-of-thought AI: 0-shot chain-of-thought • Self-reported

47.9%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

0-shot chain-of-thought AI: 0-shot chain-of-thought • Self-reported

69.0%

GPQA

0-shot chain-of-thought AI: In mode training with chain reasoning (0-shot chain-of-thought) solves problem, using step-by-step reasoning without examples that, how solve tasks. Model intermediate steps reasoning before provision final answer. This approach especially useful for mathematical, logical and tasks, where model should break down complex problem on sequence more simple steps • Self-reported

30.4%

Other Tests

Specialized benchmarks

ARC-C

10- • Self-reported

84.6%

Arena Hard

standard evaluation • Self-reported

37.0%

BoolQ

2-shot In this approach model first several examples (usually two) solutions similar tasks, that helps it understand format and way solutions, before than she/it solve task. This training in context, which allows model its answers on basis examples • Self-reported

78.0%

GovReport

standard evaluation • Self-reported

25.9%

MEGA MLQA

standard evaluation • Self-reported

61.7%

MEGA TyDi QA

standard evaluation • Self-reported

62.2%

MEGA UDPOS

standard evaluation • Self-reported

46.5%

MEGA XCOPA

standard evaluation • Self-reported

63.1%

MEGA XStoryCloze

standard evaluation • Self-reported

73.5%

MMLU-Pro

0-shot chain-of-thought AI: Method 0-shot chain-of-thought (0-shot CoT) encourages model intermediate steps reasoning before provision final answer, not using at this examples. This by means of simple prompts, such how "Let's reason step for step", to By comparison with 0-shot which direct answer, 0-shot CoT improves performance on tasks, requiring complex reasoning, especially for large language models. Although 0-shot CoT not so efficient, how few-shot CoT (where model show examples reasoning), he significantly in since not requires creation examples reasoning for each new tasks. this method its tool for improvements performance in tasks reasoning • Self-reported

47.4%

MMMLU

5-shot evaluation several ways evaluation modern language models (LLM) on new tasks. from such approaches, which we used in this this «k-shot evaluation»: model is provided k examples execution tasks before that, how she/it solve new example. We this «5-shot» evaluation, when model 5 examples. How this works? For demonstration, several examples us k-shot query for models: Task consists in that, is whether in between AI or represents itself «». For Claude and GPT-4, our query with descriptions tasks and instructions, and then are provided k=5 examples with correct answers. Then model on manner, k-shot evaluation allows us evaluate, how well well model can solution tasks on basis number examples, it. This especially useful for understanding abilities LLM to training in context • Self-reported

55.4%

OpenBookQA

10-AI: 10- • Self-reported

79.2%

PIQA

5-shot set from 5 tasks and answers LLM for each tasks, on which from tasks LLM correct answers. First, for each tasks thoroughly problem independently and correct answer. intermediate steps model and answer, in order to determine, correctly whether solution model. Then solution model with and errors following manner: 1. errors — model makes actually statement 2. errors — model uses incorrect methods, evidence or applies incorrect manner 3. errors — model allows errors in computations 4. errors — in reasoning model logical errors 5. solutions — model provides answer After analysis answers on all 5 tasks, tasks, to which model fully correct answers • Self-reported

81.0%

Qasper

standard evaluation • Self-reported

41.9%

QMSum

standard evaluation • Self-reported

21.3%

RepoQA

Average • Self-reported

77.0%

RULER

128k AI: This paper proposes the use of large context windows (128K tokens) to allow language models like GPT-4 to process and utilize large amounts of relevant information at once, which can dramatically enhance performance on difficult reasoning tasks. Description of the method: When faced with a complex problem, the 128K approach involves: 1. First gathering extensive high-quality relevant information (examples, theorems, techniques, etc.) 2. Structuring this information so the model can readily access it 3. Providing the problem along with all this context in a single prompt 4. Having the model reason through the problem with all resources available simultaneously This method leverages the model's ability to attend to any part of the 128K context window at any time during its reasoning process. It's especially effective for problems requiring specialized knowledge, complex reasoning, or access to multiple examples. The authors highlight that this approach eliminates the need for complex tool use, agent architectures, or retrieval augmentation strategies in many cases - by simply giving the model everything it might need upfront in a well-structured format. The performance improvements are particularly notable on tasks like GPQA (technical questions across STEM fields), MMLU (professional knowledge benchmarks), and mathematical problem-solving competitions. • Self-reported

84.1%

Social IQa

5-shot We step-by-step solution tasks with using five various examples tasks with solutions. These examples were model for each new tasks. showed, that provision examples models generate more exact answers, especially for complex tasks, where steps solutions. Models could template solutions from examples to new tasks. We discovered, that examples with detailed each step were than examples, which simply sequence This models better understand logic solutions and apply her/its to new In our evaluation we percentage correct answers and intermediate steps. we how well model solutions from examples and where they in order to to new tasks. Method 5-shot especially for tasks, requiring specific format answer or approach to solving, allowing models quickly adapt to format without additional instructions • Self-reported

74.7%

SQuALITY

standard evaluation • Self-reported

24.3%

SummScreenFD

standard evaluation • Self-reported

16.0%

License & Metadata

License

mit

Announcement Date

August 23, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Phi 4 Mini

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Feb 2025

Phi 4 Mini Reasoning

Microsoft

3.8B

Best score:0.5 (GPQA)

Released:Apr 2025

Llama 3.1 8B Instruct

Gemma 2 9B

Google

9.2B

Best score:0.7 (MMLU)

Released:Jun 2024

Ministral 8B Instruct

Mistral AI

8.0B

Best score:0.7 (ARC)

Released:Oct 2024

Price:$0.10/1M tokens

Llama 3.2 3B Instruct

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Qwen2 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Jul 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.