Phi 4 Reasoning Plus
Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model fine-tuned from Phi-4 using supervised fine-tuning and reinforcement learning. The model specializes in math, science, and coding. This 'plus' version features increased accuracy through additional reinforcement learning but may have higher latency.
Key Specifications
Parameters
14.0B
Context
-
Release Date
April 30, 2025
Average Score
78.9%
Timeline
Key dates in the model's history
Announcement
April 30, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
14.0B
Training Tokens
16.0B tokens
Knowledge Cutoff
March 1, 2025
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
Reasoning
Logical reasoning and analysis
GPQA
Method Diamond - structure for deep and analysis evidence and reasoning. He for obtaining conclusions with number information. Diamond various interpretation and Structure Diamond consists from key : 1. : question or problem, which necessary solve. 2. : and all data and information. 3. : various possible interpretation and 4. "for": strong in 5. : strong against this 6. Analysis : and between "for" and "against". 7. : strong side sides, limitations. 8. : output, analysis. 9. : and their value. This method especially for: - complex problems - evaluation and solutions - thinking Diamond, errors, such how or conclusions. efficient Diamond at execution each step, especially when about and analysis • Self-reported
Other Tests
Specialized benchmarks
AIME 2024
Standard evaluation
AI • Self-reported
AIME 2025
Standard evaluation We we evaluate improvements knowledge with help FreshQA, set data, which measures accuracy knowledge about after training. For evaluation general knowledge by various we we use MMLU, set data for evaluation training. For verification abilities to reasoning we we use two set data: GSM8K for reasoning and BBH (Big-Bench Hard) for more reasoning. For testing general we we use MT-Bench, which evaluates quality with help GPT-4 in capacity All our evaluation without additional instructions and tools. that, we basic tests knowledge and tasks in work and 2022 • Self-reported
Arena Hard
Standard evaluation
AI: GPT-4 is tested with this standard evaluation on GPQA. Using basic zero-shot prompting with the suggested format, GPT-4 achieved 19.0% accuracy on the "easy" split of GPQA, and 5.2% on the "challenging" split. • Self-reported
FlenQA
3K-tokens AI: text with approximately 3000 tokens (for example, and etc.etc.). In order to you use OpenAI Tokenizer (https://platform.openai.com/tokenizer). Human: text. AI: on questions about text, using one from approaches: 1. Chain-of-thought: I will reason step for step 2. Standard answer: direct answer without reasoning 3. thinking: I will reason with questions Human: specific questions about text. AI: on questions, using approach. If you not approach, I that, which most • Self-reported
HumanEval+
Standard evaluation AI: Method evaluation ChatGPT and other LLM usually are evaluated on sets standard tasks, such how MMLU (language tests), GSM8K (tasks by mathematics initial school) and HumanEval (tasks by programming). These benchmarks fields, including and programming. GPT-4 achieves results on these tests, or to in many fields. For example, on MMLU he receives 86.4%, that outperforms results people. On GSM8K model solves more 90% tasks, that to These tests, although and have limitations: 1. They not can fully evaluate knowledge model in specific field 2. tasks are by comparison with 3. They often make on facts, and not on • Self-reported
IFEval
Standard evaluation • Self-reported
LiveCodeBench
8/1/24–2/1/25 • Self-reported
MMLU-Pro
Standard evaluation AI: This most common type at which systems are used in their In question or query, and system answers, using its including thinking and other : • use • behavior, with which usually users • Good determines base performance : • capabilities • can abilities • comparison models, since their Examples queries: * "When first was ?" * "main differences between and " * "x^3*sin(x) from 0 to π" • Self-reported
OmniMath
Standard evaluation AI: In time, evaluation LLM on tasks, requiring reasoning, one from tasks often include reasoning more than basic QA, and can include solution problems, requiring mathematical computations, scientific and reasoning, and also interpretation and facts or We we evaluate Claude 3.5 Sonnet by tests on reasoning, including GPQA, DROP, MMLU, BBH, MATH, GSM8K, SAT, LSAT and Big-Bench Hard. These tests by format, complexity and domain field. In each case we we use evaluation, in which representation instructions and assignments Claude 3.5 Sonnet and other models in one and that indeed format. During all evaluationmodel were on use mode temperature=0 (without ). We not used prompts, several examples, not used methods CoT or thinking, and not several attempts • Self-reported
PhiBench
2.21 AI: *I that this simply number/In text no information for translation* • Self-reported
License & Metadata
License
mit
Announcement Date
April 30, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsPhi 4 Reasoning
Microsoft
14.0B
Best score:0.9 (HumanEval)
Released:Apr 2025
Phi 4
Microsoft
14.7B
Best score:0.8 (MMLU)
Released:Dec 2024
Price:$0.07/1M tokens
Phi-3.5-MoE-instruct
Microsoft
60.0B
Best score:0.9 (ARC)
Released:Aug 2024
LongCat-Flash-Lite
Meituan
68.5B
Best score:0.9 (MMLU)
Released:Feb 2026
GLM-4.7-Flash
Zhipu AI
30.0B
Best score:0.8 (TAU)
Released:Jan 2026
Price:$0.07/1M tokens
ERNIE 4.5
Baidu
21.0B
Best score:0.7 (GPQA)
Released:Jun 2025
Llama 3.3 70B Instruct
Meta
70.0B
Best score:0.9 (HumanEval)
Released:Dec 2024
Price:$0.88/1M tokens
Qwen2.5 32B Instruct
Alibaba
32.5B
Best score:0.9 (HumanEval)
Released:Sep 2024
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.