Microsoft logo

Phi 4 Reasoning

Microsoft

Phi-4-reasoning is a state-of-the-art open-weight reasoning model fine-tuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought reasoning traces and reinforcement learning. It focuses on math, science, and coding skills.

Key Specifications

Parameters
14.0B
Context
-
Release Date
April 30, 2025
Average Score
75.1%

Timeline

Key dates in the model's history
Announcement
April 30, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
14.0B
Training Tokens
16.0B tokens
Knowledge Cutoff
March 1, 2025
Family
-
Fine-tuned from
phi-4
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis
GPQA
Diamond AI: by texts about models artificial intelligence. human. Translate following text descriptions method analysis model AI on Russian language, rules: 1. on language. 2. all and in form (for example: GPT, LLM, API, AIME, GPQA). (for example: "thinking mode" → "mode thinking", "tools" → "tools"). 3. and 4. descriptions. 5. Not information, only then, that maintaining all details. 6. models (for example "GPT-5 nano", "Claude") on 7. benchmarks and on (for example: "AIME", "FrontierMath", "Harvard-MIT Mathematics Tournament"). 8. should be on text, 9. explanations, quotes or — on ONLY translationSelf-reported
65.8%

Other Tests

Specialized benchmarks
AIME 2024
Standard evaluation AI: ThinkerSelf-reported
75.3%
AIME 2025
Standard evaluation AISelf-reported
62.9%
Arena Hard
Standard evaluation AI: Standard evaluationSelf-reported
73.3%
FlenQA
3K-AI: this 3000-() data. In context models machine training and processing language usually matches parts or in 3000 tokens means, that for analysis is used set data specific sizeSelf-reported
97.7%
HumanEval+
Standard evaluation AI: works better, if at us is about that, how well well work systems, which we For AI this often can be complex task, but this part. We we evaluate its model on various data and benchmarks, but some which usually are used in : * : We accuracy our models on various standard sets data, which evaluate their understanding diverse that, such how and For example, MMLU — this set data from more 14,000 questions with multiple choice, 57 We also over testing for evaluation understanding, especially by more knowledge, for example in exact * and logical thinking: We we evaluate ability models solve different mathematical and logical tasks, from base to mathematics. We we use how standard tests in this field (for example, MATH and GSM8K), so and new, more complex tests, in order to evaluate abilities our models in solving tasks and their reliability in these fields. * Reasoning and : We we measure ability models to and with common (sense) meaning, and also ability interpret such how and For example, HellaSwag verifies ability model choose most that requires from her understanding * : We we evaluate ability our models and code on various languages programming, and also solve different tasks programming, understand and answer on questions about code. We we use data, such how HumanEval for Python, MBPP for various languages programming and Natural2Code for more diverse tasksSelf-reported
92.9%
IFEval
Standard evaluationSelf-reported
83.4%
LiveCodeBench
8/1/24–2/1/25Self-reported
53.8%
MMLU-Pro
Standard evaluation AI: Yes, this is a standard benchmark evaluation. What can I help you with?Self-reported
74.3%
OmniMath
Standard evaluation AI: ChatGPT 4o *Process* We've established a standard evaluation protocol for mathematical assessments. 1. The test is presented to the AI one problem at a time. 2. For each problem, the AI is instructed to work step-by-step and provide a final answer. 3. We score responses in a binary fashion - correct or incorrect. 4. No partial credit is awarded. 5. Answers must match the exact form specified in the problem (e.g., reduced fractions, simplified expressions). 6. For multiple-choice questions, only the letter choice is required for scoring. *Implementation notes* - All problems are presented with clear instructions on required answer format. - The AI receives no external feedback during testing. - We test each model variant with the same problems in identical order. - Evaluations are conducted without human intervention or feedback loops. This protocol ensures consistent measurement across different AI systems and provides a clear benchmark for mathematical reasoning capabilities.Self-reported
76.6%
PhiBench
2.21 AI: I'll evaluate it on the MMLU benchmark. MMLU stands for Massive Multitask Language Understanding, a benchmark that tests the model on a wide variety of tasks such as elementary mathematics, US history, computer science, law, and more. It measures the model's ability to apply knowledge across these different domains, requiring both factual recall and reasoning. Let me break down the model's performance by category.Self-reported
70.6%

License & Metadata

License
mit
Announcement Date
April 30, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.