Phi 4 Mini Reasoning

Name: Phi 4 Mini Reasoning
Author: Microsoft

Microsoft

Phi-4-mini-reasoning is designed for multi-step, logic-intensive mathematical problem-solving tasks in memory/compute-constrained environments and latency-sensitive scenarios. Some use cases include formal proof generation, symbolic computation, complex word problems, and a wide range of mathematical reasoning scenarios. These models excel at maintaining context across steps, applying structured logic, and providing accurate, reliable solutions in domains requiring deep analytical thinking.

Key Specifications

Parameters

3.8B

Context

Release Date

April 30, 2025

Average Score

68.0%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

April 30, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

3.8B

Training Tokens

150.0B tokens

Knowledge Cutoff

February 1, 2025

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis

GPQA

Diamond Diamond - this model for solutions tasks, how addition to base model. She/It offers structured for solutions long chains logical reasoning with help steps, explanations and Diamond Tree of Thoughts (ToT), but with for improvement to He applies for complex tasks, where each step includes approach, containing: 1. subtasks 2. reasoning for her/its solutions 3. Solution this 4. solutions with help approach Model Diamond for mathematical tasks, although possible application and in other fields. She/It works, using which base model through this structured process solutions problems • Self-reported

52.0%

Other Tests

Specialized benchmarks

AIME

Standard evaluation AI: At is model, which can solve tasks. Method evaluation: I model set tasks and how many from them she/it will solve correctly. : • correct answers and determine accuracy • compare different model • Allows evaluate ability model correct answers Disadvantages: • Not gives representations about thoughts model • Not accounts for, how well model to solving at answer • Not accounts for, indeed whether model solved task or simply answer • understanding errors model Examples: • GSM8K: tasks by mathematical reasoning, where accuracy is determined how proportion correct answers • MMLU: set tests with multiple choice by 57 subjects, where only option • Self-reported

57.5%

MATH-500

Standard evaluation AI: ChatGPT : "Analysis capabilities GPT-4 in solving complex mathematical tasks" Standard methodology evaluation: In our evaluation mathematical abilities GPT-4 we used set from 20 tasks complexity, from and AIME. Each task was model in format without additional prompts or instructions. We answers in three categories: 1. Fully correct solution: model answer with correct and steps. 2. Partially correct solution: model approach, but errors in computations or 3. solution: model incorrect approach or incorrect answer. For each tasks we three attempts and best result. evaluation through API with 0.2, in order to ensure but at this • Self-reported

94.6%

License & Metadata

License

mit

Announcement Date

April 30, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Phi-3.5-mini-instruct

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Aug 2024

Price:$0.10/1M tokens

Phi 4 Mini

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Feb 2025

Llama 3.1 Nemotron Nano 8B V1

NVIDIA

8.0B

Best score:0.5 (GPQA)

Released:Mar 2025

Ministral 8B Instruct

Mistral AI

8.0B

Best score:0.7 (ARC)

Released:Oct 2024

Price:$0.10/1M tokens

DeepSeek R1 Distill Qwen 7B

DeepSeek

7.6B

Best score:0.5 (GPQA)

Released:Jan 2025

DeepSeek R1 Distill Llama 8B

DeepSeek

8.0B

Best score:0.5 (GPQA)

Released:Jan 2025

Phi 4 Reasoning Plus

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Phi 4 Reasoning

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.