Phi 4 Mini

Name: Phi 4 Mini
Author: Microsoft

Microsoft

Phi 4 Mini Instruct is a lightweight open model with 3.8 billion parameters built on synthetic data and filtered web data, specializing in high-quality reasoning. It supports a 128K token context window and has been refined for instruction following and safety through supervised fine-tuning and direct preference optimization.

Key Specifications

Parameters

3.8B

Context

Release Date

February 1, 2025

Average Score

65.4%

Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

February 1, 2025

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

3.8B

Training Tokens

5.0T tokens

Knowledge Cutoff

June 1, 2024

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

5-shot • Self-reported

69.1%

MMLU

5-shot • Self-reported

67.3%

TruthfulQA

MC2, 10-shot In this method we we use MC2 () for improvement performance base model on complex tasks in training with several examples (k-shot). In we where base model access to k examples, for which should new query. We we compare when model directly answers on new query, with approach MC2, where model uses data examples for formation new "" (model), which then is applied for answer on new query. In 10-shot we and examples, which demonstrate more high performance or specific abilities, than directly model, for tasks solutions mathematical tasks. Then we we use these examples for which will to query. This approach especially useful for evaluation MC2 on tasks, requiring complex mathematical reasoning • Self-reported

66.4%

Winogrande

5-shot Task uses format few-shot for demonstration models samples answers. 5-shot means, that model 5 examples for understanding format output before new query. This approach helps improve quality answers, model context and template for formation correct answers. In context machine training "shot" relates to to number examples, model for training or settings. 5-shot means, that for tasks is provided 5 examples. Method especially efficient at work with models, since allows "" their behavior through examples instead instructions, data and corresponding results • Self-reported

67.0%

Mathematics

Mathematical problems and computations

GSM8k

8-shot, CoT Method 8-shot, CoT (chain thinking) includes provision model examples execution tasks with solutions, and then query on solution new tasks. Examples demonstrate process reasoning, usually with phrases "let's let's think step for step", which helps model structure its answer. 8-shot means, that model is provided examples questions with detailed answers, how correctly apply chain thinking for solutions tasks specific type. When model then receives new task, she/it should step-by-step reasoning for formation its answer • Self-reported

88.6%

MATH

0-shot, CoT Computation without with reasoning AI: 0-shot, CoT Computation without with reasoning • Self-reported

64.0%

MGSM

5-shot • Self-reported

63.9%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

0-shot, CoT System offers model solve task without any-or additional examples, but encourages her/its its reasoning step for step, before than give final answer. This allows intermediate steps thinking model • Self-reported

70.4%

GPQA

0-shot, CoT In this method we we provide task model without any-or examples execution (0-shot) and we ask her/its think reasoning (Chain of Thought, CoT). Method chains reasoning that model step by step shows course its thoughts at solving tasks, that helps it to more answer, especially in complex cases. Instead that, in order to immediately answer, model process its reasoning, to solving. When 0-shot approach we not model no/none preliminary examples that, how solve tasks, that she/it using only its preliminarily obtained knowledge • Self-reported

25.2%

Other Tests

Specialized benchmarks

ARC-C

10-shot AI: 10-shot solutions model, in order to format and answers. can be or that indeed model in or from (for example, from people). Key : - samples solutions for 10 various tasks (in some can be 5-shot or other number examples) - Can include examples reasoning or only answers - model information about format and approach to solving tasks - Can reasoning or solutions : - ability model use format - Allows model understand level reasoning - methods solutions Usually is applied in tasks, requiring format or use method reasoning • Self-reported

83.7%

Arena Hard

Standard evaluation AI: answer model on query with This approach for measurement base abilities model answer on query. We simply query model and we receive answer. When users query, we sometimes we can apply specific settings, such how and however for given evaluation we on 0, in order to obtain most answer model. : evaluation, practically to all We we use evaluation for measurement base abilities model answer on query. Advantages: measurement abilities, Disadvantages: to model can give or answers without additional instructions or limitations • Self-reported

32.8%

BoolQ

In this approach we first we use two example, in order to model, that from her This technique helps model understand structure output and reasoning, which we without necessity explicitly each 2-shot approach in that, that examples can quickly behavior, that especially useful, when we in order to model format or reasoning. Demonstration several examples also can help since model several tasks. However is then, that for some complex tasks two examples can be for demonstration all behavior. that, if examples or not possible data, model can representation about task • Self-reported

81.2%

MMLU-Pro

0-shot, CoT Method 0-shot, CoT (chain thinking without examples) consists in that, that model receives task without any-or examples solutions and instruction think step for step, before than give final answer. This approach encourages model course its reasoning, complex task on more simple components. In this method explicitly model reason, using prompts type "Let's let's think step for step" or "step by step". instructions model generate intermediate steps reasoning before final answer. Research showed, that even without examples solutions tasks ("0-shot") simple model think sequentially can significantly improve her/its performance on complex tasks, requiring multi-step reasoning • Self-reported

52.8%

Multilingual MMLU

5-shot • Self-reported

49.3%

OpenBookQA

10-shot AI: solutions model and several examples, where model answers. with demonstration tasks and that, how correct solutions. Then 10 different approaches, which model can use for obtaining correct answers. Usually is used for: - solutions tasks - reasoning - systems models complex reasoning - solutions - reasoning - and concepts How this do: 1. task or tasks for testing. 2. 10+ examples solutions from model. 3. solutions and various approaches and strategies. 4. approaches from most to less 5. these examples, in order to show model correct approaches. 6. When with model examples correct approaches and her/its solve new problem, using Example prompts: "I 10 various approaches to solving tasks on on different methods and strategies." : - This method can use for improvements performance model by means of demonstration diverse approaches to solving. - examples, that better model approaches. - at testing, can whether model or known strategies. - Especially for tasks, requiring various for obtaining correct answer • Self-reported

79.2%

PIQA

5-shot Method several examples (few-shot) — this approach to training and language models, at which model is provided several examples execution tasks before that, how she/it should execute new task. In context 5-shot model receive five examples (samples) correct execution tasks. These examples demonstrate format and answer, model understand and context. This method especially useful, when: - Model should adapt to task without additional training - format output - reasoning or In difference from zero-shot approach (without examples) or one-shot (with one example), 5-shot ensures more results for number samples, from which model can Research show, that performance models often with number examples to specific after which or even efficiency from-for context • Self-reported

77.6%

Social IQa

5-shot • Self-reported

72.5%

License & Metadata

License

mit

Announcement Date

February 1, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Phi-3.5-mini-instruct

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Aug 2024

Price:$0.10/1M tokens

Phi 4 Mini Reasoning

Microsoft

3.8B

Best score:0.5 (GPQA)

Released:Apr 2025

Llama 3.1 8B Instruct

Llama 3.2 3B Instruct

Llama 3.1 Nemotron Nano 8B V1

NVIDIA

8.0B

Best score:0.5 (GPQA)

Released:Mar 2025

Gemma 2 9B

Google

9.2B

Best score:0.7 (MMLU)

Released:Jun 2024

Ministral 8B Instruct

Mistral AI

8.0B

Best score:0.7 (ARC)

Released:Oct 2024

Price:$0.10/1M tokens

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.