Phi 4

Name: Phi 4
Author: Microsoft

Microsoft

phi-4 is a state-of-the-art open model designed to excel at advanced reasoning, coding, and knowledge tasks. It uses a combination of synthetic data, filtered web data, academic texts, and supervised fine-tuning to ensure accuracy, alignment, and safety.

Key Specifications

Parameters

14.7B

Context

16.0K

Release Date

December 12, 2024

Average Score

66.0%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

December 12, 2024

Last Update

July 19, 2025

Today

July 6, 2026

Technical Specifications

Parameters

14.7B

Training Tokens

9.8T tokens

Knowledge Cutoff

June 1, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.07

Output (per 1M tokens)

$0.14

Max Input Tokens

16.0K

Max Output Tokens

16.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

simple-evals approach to evaluation language abilities language models and comparison models. : - and execution - tests, different abilities - Without capabilities "data"; assignments Without necessity in complex verification answers; majority answers For this need to: - evaluation abilities and comparison models - models - representation about that, that model can and not can do tests: - ****: tests on basic abilities thinking and reasoning - ****: tests on mathematical abilities various levels complexity - ****: tests on ability code and ****: tests on knowledge • Self-reported

84.8%

Programming

Programming skills tests

HumanEval

simple-evals for simple and evaluations on LLM. ``` pip install simple-evals ``` Examples evaluation use ```python from simple_evals import AutoEvaluator evaluator = AutoEvaluator(model_name="gpt-4-turbo-preview") questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` With using ```python from simple_evals import AutoEvaluator, LLMRunnable class CustomModel(LLMRunnable): def run(self, prompt: str) -> str: # logic return "Answer on: " + prompt evaluator = AutoEvaluator(model=CustomModel()) questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` evaluation ```python from simple_evals import SemiAutoEvaluator evaluator = SemiAutoEvaluator( model_a_name="gpt-4-turbo-preview", model_b_name="gpt-3.5-turbo" ) questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` evaluation with using ```python from simple_evals import SemiAutoEvaluator, LLMRunnable class CustomModelA(LLMRunnable): def run(self, prompt: str) -> str: # logic return "Answer from model A: " + prompt class CustomModelB(LLMRunnable): def run(self, prompt: str) -> str: # logic return "Answer from model B: " + prompt evaluator = SemiAutoEvaluator( model_a=CustomModelA(), model_b=CustomModelB() ) questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` results results in CSV-```python evaluator.to_csv("results.csv") ``` results in JSON-```python evaluator.to_json("results.json") ``` • Self-reported

82.6%

Mathematics

Mathematical problems and computations

MATH

simple-evals system for evaluation LLM on basis output. ## simple-evals — this system for settings LLM and evaluation their behavior on basis output. Key : - through YAML - various including and capability Computation metrics with results - all with model for analysis simple-evals on benchmarks, which can and which give about LLM. ## ``` pip install simple-evals ``` ## Use simple-evals fully through YAML-YAML-determines evaluation, including model and method evaluation: ```yaml name: Test Evaluation description: Verification basic capabilities version: 0.1 models: - name: gpt-3.5-turbo type: openai - name: gpt-4 type: openai evaluator: type: simple metrics: - accuracy ``` Then you Each contains examples: ```yaml name: Math Problems description: mathematical tasks examples: - name: addition input: will 2+2? checks: - type: contains value: "4" - name: multiplication input: will 7*8? checks: - type: contains value: "56" ``` evaluation : ``` simple-evals run config.yaml ``` ## capabilities simple-evals supports: - verification by means of `Check` - with various LLM (OpenAI, Anthropic, model) - data for deep analysis and Verification — this only beginning. You complex evaluation with for each example, in order to behavior model • Self-reported

80.4%

MGSM

simple-evals AI: Translate on Russian language, I only translation • Self-reported

80.6%

Reasoning

Logical reasoning and analysis

DROP

simple-evals This tool evaluates quality several models AI on standard tests thinking. We we use its, in order to track progress in capabilities models LLM. What our tests: - Reasoning, and tasks type head-scratchers - and computation - and understanding and and science - mathematics with tasks - Tasks with where important not information You tests following manner: 1. this 2. `python run_evals.py --help` for obtaining 3. For example: `python run_evals.py --model gpt-3.5-turbo --dataset mmlu_stem` evaluation GPT-3.5 on MMLU STEM You also its own model, if for model, which describes her/its API • Self-reported

75.5%

GPQA

# simple-evals and methods for evaluation LLM, on use, and ## 1. **:** Use standard (JSON, CSV) and from creation complexity. In cases is used one and one in "correctly/incorrectly" or numerical evaluations. 2. **:** including instructions, prompts and answers, in order to can was fully without determination 3. **:** on several sets data and several LLM with ## set complex tools for evaluation, but they solve too tasks immediately and simple scenarios. we we can: - and tests - errors in evaluationfull prompts and data on basis connection - evaluation set models on data ## through `pip install simple-evals` and how: ```python from simple_evals.eval import evaluate_tasks results = evaluate_tasks( llms={ "gpt-4-turbo": lambda x: call_openai("gpt-4-turbo", x), "claude-3-opus": lambda x: call_anthropic("claude-3-opus", x), }, tasks={ "gsm8k": lambda: get_gsm8k_tasks(20), "mmlu": lambda: get_mmlu_tasks(["physics", "chemistry"], 20), }, system_message="AI and exactly.", ) ``` ## ### Evaluation tasks - `evaluate_tasks(llms, tasks, system_message)`: several models on several sets tasks, results and metrics. ### Tasks QA - `binary_qa_task(question, answer)`: task, where model should answer "correctly" or "incorrectly" - `choice_qa_task(question, choices, answer)`: task with multiple choice, where model should choose correct answer ### Computation metrics - `compute • Self-reported

56.1%

Other Tests

Specialized benchmarks

Arena Hard

simple-evals AI: Translate text simple-evals - this for simple and evaluation performance LLM on various tasks and benchmarks. We all and LLM use this for evaluation own models, and also testing. On this we following benchmarks: * matheval: numbers and etc.etc. * mmlueval: Evaluation MMLU * codeeval: tasks programming and etc.etc. * langeval: Understanding and generation language * reasoningeval: • Self-reported

75.4%

HumanEval+

simple-evals and set tools for evaluation models AI (for example, GPT-4, Claude) by various What such simple-evals? simple-evals — this set tools, which computational evaluations for large language models. We we provide way obtaining answers from set models through various API and set tests for evaluation capabilities models. capabilities: - various evaluations, including GPQA, and other - Query several models through OpenAI API, Claude API and etc.etc., and also models - for and improvement for analysis results - for creation own evaluations ``` pip install simple-evals ``` Use how can use simple-evals for obtaining answers from models on set questions: ```python import simple_evals as se # all questions GPQA on GPT-4 results = se.run_evals( eval_set="gpqa", models="gpt-4", ) # results se.visualize(results) ``` evaluation - GPQA: questions for evaluation knowledge in field on level : questions from MATH and GSM8K - tasks: HumanEval and MBPP - : and : generation and evaluation multi-step tasks: ARC, BBH, MMLU - other... model - OpenAI (GPT-4, GPT-3.5) - Anthropic (Claude) - Mistral and Mixtral - Gemini - model through vLLM, transformers and etc.etc. Extended use ```python # set tests on different models results = se.run_evals( eval_sets=["gpqa", "gsm8k", "humaneval"], models=["gpt-4", "gpt-3.5-turbo", "claude-2"], max_samples=100, # number samples cache=True, # for tokens ) # and results se.analyze • Self-reported

82.8%

IFEval

# simple-evals simple-evals — this tool for evaluation and verification LLM, which can easily and under various He consists from set tools, which evaluate model by tasks. ## Key model on diverse tasks in "question and answer" with help our for evaluation - evaluation for basic abilities reasoning, /and tests for own tasks - evaluation with help other model in capacity for evaluation model and to models through API (vLLM) or API (OpenAI, Anthropic, and etc.etc.) ## ```bash pip install simple-evals ``` for use specific For example: ```bash pip install simple-evals[openai] # for use models OpenAI pip install simple-evals[anthropic] # for use models Anthropic pip install simple-evals[huggingface] # for models Hugging Face ``` ## Beginning work In order to quickly work, can CLI-tool `simple-evals`: ```bash $ simple-evals run path/to/questions.jsonl --model gpt-4o ``` through Python: ```python from simple_evals import evaluator, backends, questions # and tasks backend = backends.OpenAIBackend(model="gpt-4o") qs = questions.load_questions("path/to/questions.jsonl") # evaluation results = evaluator.evaluate(qs, backend) # information evaluator.print_results_summary(results) ``` ## on https://simple-evals.readthedocs.io/ • Self-reported

63.0%

LiveBench

simple-evals for evaluation LLM function-from LLM for evaluation generation, on (possible, with examples). for tasks: from to creation ## How works? 1. task, set instructions and criteria evaluation 2. model generates answer 3. model (can be indeed) evaluates match answer instructions and criteria 4. results in various ## Examples use * Comparison models on specific tasks * improvement prompts and methods * competitions inside ## ``` pip install simple-evals ``` ## ```bash # evaluation simple-evals evaluate --task-file samples/tasks/arithmetic.yaml # Comparison different prompts or models simple-evals evaluate --task-file samples/tasks/arithmetic.yaml --comparison --runs 5 ``` • Self-reported

47.6%

MMLU-Pro

simple-evals AI: # Strong and weak side, and also limitations (SWOLA) ## Definition Method SWOLA (Strengths, Weaknesses, and Limitations of Attributes) — this method evaluation and analysis, for identification strong and sides specific and also their limitations. This method is applied for formation understanding ## Methodology SWOLA includes in itself: 1. **Definition for evaluation**: key or 2. **Analysis strong sides**: positive aspects each which advantages or performance. 3. **Analysis sides**: or fields, requiring improvements for each 4. **Evaluation limitations**: Definition or scenarios, at which can efficiency or 5. **analysis**: information for obtaining representations about each with context and evaluation. ## Application SWOLA is applied for: - Evaluations models machine training and their algorithms - analysis various approaches - fields for research and improvements - solutions about or ## Advantages - for evaluation - Allows obtain understanding and capabilities for ## Limitations - knowledge in field - Can be without thoroughly specific Efficiency depends from completeness understanding for interpretation results • Self-reported

70.4%

PhiBench

simple-evals for work and comparison large language models. She/It evaluation performance models on basis in code. You in part with examples (on yaml), and simple-evals automatically answers from each language model criteria. Results can in or in CSV. supports various criteria evaluation: - best answer from several models - Evaluation answers models on match criteria - Verification, contains whether answer model or Evaluation on basis computations Comparison models very instructions: - its prompt - answers from several models - criteria comparison - results for decision-making solutions • Self-reported

56.2%

SimpleQA

simple-evals AI: methods evaluation • Self-reported

3.0%

License & Metadata

License

mit

Announcement Date

December 12, 2024

Last Updated

July 19, 2025

Articles about Phi 4

Microsoft Just Built Its Own AI Models — With Teams of 10 People

Mustafa Suleyman's superintelligence team ships MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — state-of-the-art models built by tiny teams that beat OpenAI Whisper and undercut every hyperscaler on price.

April 3, 2026

9 min

Similar Models

All Models

Phi 4 Reasoning Plus

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Phi 4 Reasoning

Microsoft

14.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Phi-3.5-MoE-instruct

Microsoft

60.0B

Best score:0.9 (ARC)

Released:Aug 2024

Llama 3.1 70B Instruct

Hermes 3 70B

Nous Research

70.0B

Best score:0.8 (MMLU)

Released:Aug 2024

Nemotron 3 Nano (30B A3B)

NVIDIA

32.0B

Best score:0.8 (GPQA)

Released:Dec 2025

Price:$0.06/1M tokens

Llama 3.1 Nemotron 70B Instruct

NVIDIA

70.0B

Best score:0.8 (MMLU)

Released:Oct 2024

Codestral-22B

Mistral AI

22.2B

Best score:0.8 (HumanEval)

Released:May 2024

Price:$0.20/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.