Key Specifications
Parameters
14.0B
Context
-
Release Date
April 30, 2025
Average Score
75.1%
Timeline
Key dates in the model's history
Announcement
April 30, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
14.0B
Training Tokens
16.0B tokens
Knowledge Cutoff
March 1, 2025
Family
-
Fine-tuned from
phi-4
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
Reasoning
Logical reasoning and analysis
GPQA
Diamond AI: by texts about models artificial intelligence. human. Translate following text descriptions method analysis model AI on Russian language, rules: 1. on language. 2. all and in form (for example: GPT, LLM, API, AIME, GPQA). (for example: "thinking mode" → "mode thinking", "tools" → "tools"). 3. and 4. descriptions. 5. Not information, only then, that maintaining all details. 6. models (for example "GPT-5 nano", "Claude") on 7. benchmarks and on (for example: "AIME", "FrontierMath", "Harvard-MIT Mathematics Tournament"). 8. should be on text, 9. explanations, quotes or — on ONLY translation • Self-reported
Other Tests
Specialized benchmarks
AIME 2024
Standard evaluation
AI: Thinker • Self-reported
AIME 2025
Standard evaluation
AI • Self-reported
Arena Hard
Standard evaluation
AI:
Standard evaluation • Self-reported
FlenQA
3K-AI: this 3000-() data. In context models machine training and processing language usually matches parts or in 3000 tokens means, that for analysis is used set data specific size • Self-reported
HumanEval+
Standard evaluation AI: works better, if at us is about that, how well well work systems, which we For AI this often can be complex task, but this part. We we evaluate its model on various data and benchmarks, but some which usually are used in : * : We accuracy our models on various standard sets data, which evaluate their understanding diverse that, such how and For example, MMLU — this set data from more 14,000 questions with multiple choice, 57 We also over testing for evaluation understanding, especially by more knowledge, for example in exact * and logical thinking: We we evaluate ability models solve different mathematical and logical tasks, from base to mathematics. We we use how standard tests in this field (for example, MATH and GSM8K), so and new, more complex tests, in order to evaluate abilities our models in solving tasks and their reliability in these fields. * Reasoning and : We we measure ability models to and with common (sense) meaning, and also ability interpret such how and For example, HellaSwag verifies ability model choose most that requires from her understanding * : We we evaluate ability our models and code on various languages programming, and also solve different tasks programming, understand and answer on questions about code. We we use data, such how HumanEval for Python, MBPP for various languages programming and Natural2Code for more diverse tasks • Self-reported
IFEval
Standard evaluation • Self-reported
LiveCodeBench
8/1/24–2/1/25 • Self-reported
MMLU-Pro
Standard evaluation
AI: Yes, this is a standard benchmark evaluation. What can I help you with? • Self-reported
OmniMath
Standard evaluation
AI: ChatGPT 4o
*Process*
We've established a standard evaluation protocol for mathematical assessments.
1. The test is presented to the AI one problem at a time.
2. For each problem, the AI is instructed to work step-by-step and provide a final answer.
3. We score responses in a binary fashion - correct or incorrect.
4. No partial credit is awarded.
5. Answers must match the exact form specified in the problem (e.g., reduced fractions, simplified expressions).
6. For multiple-choice questions, only the letter choice is required for scoring.
*Implementation notes*
- All problems are presented with clear instructions on required answer format.
- The AI receives no external feedback during testing.
- We test each model variant with the same problems in identical order.
- Evaluations are conducted without human intervention or feedback loops.
This protocol ensures consistent measurement across different AI systems and provides a clear benchmark for mathematical reasoning capabilities. • Self-reported
PhiBench
2.21
AI: I'll evaluate it on the MMLU benchmark. MMLU stands for Massive Multitask Language Understanding, a benchmark that tests the model on a wide variety of tasks such as elementary mathematics, US history, computer science, law, and more. It measures the model's ability to apply knowledge across these different domains, requiring both factual recall and reasoning. Let me break down the model's performance by category. • Self-reported
License & Metadata
License
mit
Announcement Date
April 30, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsPhi 4 Reasoning Plus
Microsoft
14.0B
Best score:0.9 (HumanEval)
Released:Apr 2025
Phi 4
Microsoft
14.7B
Best score:0.8 (MMLU)
Released:Dec 2024
Price:$0.07/1M tokens
Phi-3.5-MoE-instruct
Microsoft
60.0B
Best score:0.9 (ARC)
Released:Aug 2024
LongCat-Flash-Lite
Meituan
68.5B
Best score:0.9 (MMLU)
Released:Feb 2026
Hermes 3 70B
Nous Research
70.0B
Best score:0.8 (MMLU)
Released:Aug 2024
Codestral-22B
Mistral AI
22.2B
Best score:0.8 (HumanEval)
Released:May 2024
Price:$0.20/1M tokens
GLM-4.7-Flash
Zhipu AI
30.0B
Best score:0.8 (TAU)
Released:Jan 2026
Price:$0.07/1M tokens
ERNIE 4.5
Baidu
21.0B
Best score:0.7 (GPQA)
Released:Jun 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.