Phi 4 Reasoning Plus

Name: Phi 4 Reasoning Plus
Author: Microsoft

Microsoft

Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model fine-tuned from Phi-4 using supervised fine-tuning and reinforcement learning. The model specializes in math, science, and coding. This 'plus' version features increased accuracy through additional reinforcement learning but may have higher latency.

Key Specifications

Parameters

14.0B

Context

Release Date

April 30, 2025

Average Score

78.9%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

April 30, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

14.0B

Training Tokens

16.0B tokens

Knowledge Cutoff

March 1, 2025

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis

GPQA

Method Diamond - structure for deep and analysis evidence and reasoning. He for obtaining conclusions with number information. Diamond various interpretation and Structure Diamond consists from key : 1. : question or problem, which necessary solve. 2. : and all data and information. 3. : various possible interpretation and 4. "for": strong in 5. : strong against this 6. Analysis : and between "for" and "against". 7. : strong side sides, limitations. 8. : output, analysis. 9. : and their value. This method especially for: - complex problems - evaluation and solutions - thinking Diamond, errors, such how or conclusions. efficient Diamond at execution each step, especially when about and analysis • Self-reported

68.9%

Other Tests

Specialized benchmarks

AIME 2024

Standard evaluation AI • Self-reported

81.3%

AIME 2025

Standard evaluation We we evaluate improvements knowledge with help FreshQA, set data, which measures accuracy knowledge about after training. For evaluation general knowledge by various we we use MMLU, set data for evaluation training. For verification abilities to reasoning we we use two set data: GSM8K for reasoning and BBH (Big-Bench Hard) for more reasoning. For testing general we we use MT-Bench, which evaluates quality with help GPT-4 in capacity All our evaluation without additional instructions and tools. that, we basic tests knowledge and tasks in work and 2022 • Self-reported

78.0%

Arena Hard

Standard evaluation AI: GPT-4 is tested with this standard evaluation on GPQA. Using basic zero-shot prompting with the suggested format, GPT-4 achieved 19.0% accuracy on the "easy" split of GPQA, and 5.2% on the "challenging" split. • Self-reported

79.0%

FlenQA

3K-tokens AI: text with approximately 3000 tokens (for example, and etc.etc.). In order to you use OpenAI Tokenizer (https://platform.openai.com/tokenizer). Human: text. AI: on questions about text, using one from approaches: 1. Chain-of-thought: I will reason step for step 2. Standard answer: direct answer without reasoning 3. thinking: I will reason with questions Human: specific questions about text. AI: on questions, using approach. If you not approach, I that, which most • Self-reported

97.9%

HumanEval+

Standard evaluation AI: Method evaluation ChatGPT and other LLM usually are evaluated on sets standard tasks, such how MMLU (language tests), GSM8K (tasks by mathematics initial school) and HumanEval (tasks by programming). These benchmarks fields, including and programming. GPT-4 achieves results on these tests, or to in many fields. For example, on MMLU he receives 86.4%, that outperforms results people. On GSM8K model solves more 90% tasks, that to These tests, although and have limitations: 1. They not can fully evaluate knowledge model in specific field 2. tasks are by comparison with 3. They often make on facts, and not on • Self-reported

92.3%

IFEval

Standard evaluation • Self-reported

84.9%

LiveCodeBench

8/1/24–2/1/25 • Self-reported

53.1%

MMLU-Pro

Standard evaluation AI: This most common type at which systems are used in their In question or query, and system answers, using its including thinking and other : • use • behavior, with which usually users • Good determines base performance : • capabilities • can abilities • comparison models, since their Examples queries: * "When first was ?" * "main differences between and " * "x^3*sin(x) from 0 to π" • Self-reported

76.0%

OmniMath

Standard evaluation AI: In time, evaluation LLM on tasks, requiring reasoning, one from tasks often include reasoning more than basic QA, and can include solution problems, requiring mathematical computations, scientific and reasoning, and also interpretation and facts or We we evaluate Claude 3.5 Sonnet by tests on reasoning, including GPQA, DROP, MMLU, BBH, MATH, GSM8K, SAT, LSAT and Big-Bench Hard. These tests by format, complexity and domain field. In each case we we use evaluation, in which representation instructions and assignments Claude 3.5 Sonnet and other models in one and that indeed format. During all evaluationmodel were on use mode temperature=0 (without ). We not used prompts, several examples, not used methods CoT or thinking, and not several attempts • Self-reported

81.9%

PhiBench

2.21 AI: *I that this simply number/In text no information for translation* • Self-reported

74.2%

License & Metadata

License

mit

Announcement Date

April 30, 2025

Last Updated

July 19, 2025

Articles about Phi 4 Reasoning Plus

Microsoft Just Built Its Own AI Models — With Teams of 10 People

Mustafa Suleyman's superintelligence team ships MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — state-of-the-art models built by tiny teams that beat OpenAI Whisper and undercut every hyperscaler on price.

April 3, 2026

9 min

Similar Models

All Models

Phi 4 Reasoning

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Phi 4

Microsoft

14.7B

Best score:0.8 (MMLU)

Released:Dec 2024

Price:$0.07/1M tokens

Phi-3.5-MoE-instruct

Microsoft

60.0B

Best score:0.9 (ARC)

Released:Aug 2024

LongCat-Flash-Lite

Meituan

68.5B

Best score:0.9 (MMLU)

Released:Feb 2026

GLM-4.7-Flash

Zhipu AI

30.0B

Best score:0.8 (TAU)

Released:Jan 2026

Price:$0.07/1M tokens

ERNIE 4.5

Baidu

21.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Llama 3.3 70B Instruct

Qwen2.5 32B Instruct

Alibaba

32.5B

Best score:0.9 (HumanEval)

Released:Sep 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.