Phi 4 Reasoning

Name: Phi 4 Reasoning
Author: Microsoft

Microsoft

Phi-4-reasoning is a state-of-the-art open-weight reasoning model fine-tuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought reasoning traces and reinforcement learning. It focuses on math, science, and coding skills.

Key Specifications

Parameters

14.0B

Context

Release Date

April 30, 2025

Average Score

75.1%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

April 30, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

14.0B

Training Tokens

16.0B tokens

Knowledge Cutoff

March 1, 2025

Family

Fine-tuned from

phi-4

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis

GPQA

Diamond AI: by texts about models artificial intelligence. human. Translate following text descriptions method analysis model AI on Russian language, rules: 1. on language. 2. all and in form (for example: GPT, LLM, API, AIME, GPQA). (for example: "thinking mode" → "mode thinking", "tools" → "tools"). 3. and 4. descriptions. 5. Not information, only then, that maintaining all details. 6. models (for example "GPT-5 nano", "Claude") on 7. benchmarks and on (for example: "AIME", "FrontierMath", "Harvard-MIT Mathematics Tournament"). 8. should be on text, 9. explanations, quotes or — on ONLY translation • Self-reported

65.8%

Other Tests

Specialized benchmarks

AIME 2024

Standard evaluation AI: Thinker • Self-reported

75.3%

AIME 2025

Standard evaluation AI • Self-reported

62.9%

Arena Hard

Standard evaluation AI: Standard evaluation • Self-reported

73.3%

FlenQA

3K-AI: this 3000-() data. In context models machine training and processing language usually matches parts or in 3000 tokens means, that for analysis is used set data specific size • Self-reported

97.7%

HumanEval+

Standard evaluation AI: works better, if at us is about that, how well well work systems, which we For AI this often can be complex task, but this part. We we evaluate its model on various data and benchmarks, but some which usually are used in : * : We accuracy our models on various standard sets data, which evaluate their understanding diverse that, such how and For example, MMLU — this set data from more 14,000 questions with multiple choice, 57 We also over testing for evaluation understanding, especially by more knowledge, for example in exact * and logical thinking: We we evaluate ability models solve different mathematical and logical tasks, from base to mathematics. We we use how standard tests in this field (for example, MATH and GSM8K), so and new, more complex tests, in order to evaluate abilities our models in solving tasks and their reliability in these fields. * Reasoning and : We we measure ability models to and with common (sense) meaning, and also ability interpret such how and For example, HellaSwag verifies ability model choose most that requires from her understanding * : We we evaluate ability our models and code on various languages programming, and also solve different tasks programming, understand and answer on questions about code. We we use data, such how HumanEval for Python, MBPP for various languages programming and Natural2Code for more diverse tasks • Self-reported

92.9%

IFEval

Standard evaluation • Self-reported

83.4%

LiveCodeBench

8/1/24–2/1/25 • Self-reported

53.8%

MMLU-Pro

Standard evaluation AI: Yes, this is a standard benchmark evaluation. What can I help you with? • Self-reported

74.3%

OmniMath

Standard evaluation AI: ChatGPT 4o *Process* We've established a standard evaluation protocol for mathematical assessments. 1. The test is presented to the AI one problem at a time. 2. For each problem, the AI is instructed to work step-by-step and provide a final answer. 3. We score responses in a binary fashion - correct or incorrect. 4. No partial credit is awarded. 5. Answers must match the exact form specified in the problem (e.g., reduced fractions, simplified expressions). 6. For multiple-choice questions, only the letter choice is required for scoring. *Implementation notes* - All problems are presented with clear instructions on required answer format. - The AI receives no external feedback during testing. - We test each model variant with the same problems in identical order. - Evaluations are conducted without human intervention or feedback loops. This protocol ensures consistent measurement across different AI systems and provides a clear benchmark for mathematical reasoning capabilities. • Self-reported

76.6%

PhiBench

2.21 AI: I'll evaluate it on the MMLU benchmark. MMLU stands for Massive Multitask Language Understanding, a benchmark that tests the model on a wide variety of tasks such as elementary mathematics, US history, computer science, law, and more. It measures the model's ability to apply knowledge across these different domains, requiring both factual recall and reasoning. Let me break down the model's performance by category. • Self-reported

70.6%

License & Metadata

License

mit

Announcement Date

April 30, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Phi 4 Reasoning Plus

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Phi 4

Microsoft

14.7B

Best score:0.8 (MMLU)

Released:Dec 2024

Price:$0.07/1M tokens

Phi-3.5-MoE-instruct

Microsoft

60.0B

Best score:0.9 (ARC)

Released:Aug 2024

LongCat-Flash-Lite

Meituan

68.5B

Best score:0.9 (MMLU)

Released:Feb 2026

Hermes 3 70B

Nous Research

70.0B

Best score:0.8 (MMLU)

Released:Aug 2024

Codestral-22B

Mistral AI

22.2B

Best score:0.8 (HumanEval)

Released:May 2024

Price:$0.20/1M tokens

GLM-4.7-Flash

Zhipu AI

30.0B

Best score:0.8 (TAU)

Released:Jan 2026

Price:$0.07/1M tokens

ERNIE 4.5

Baidu

21.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.