Phi-4-multimodal-instruct

Name: Phi-4-multimodal-instruct
Author: Microsoft

Multimodal

Microsoft

Phi-4-multimodal-instruct is a lightweight (5.57 billion parameter) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context window. Enhanced with SFT, DPO, and RLHF for instruction following and safety.

Key Specifications

Parameters

5.6B

Context

128.0K

Release Date

February 1, 2025

Average Score

72.0%

Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

February 1, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

5.6B

Training Tokens

5.0T tokens

Knowledge Cutoff

June 1, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.05

Output (per 1M tokens)

$0.10

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data

AI2D

Standard evaluation AI: There's an ongoing discussion about whether LLMs should be evaluated only on their raw capabilities, or on the basis of how they're fine-tuned to comply with the policies of the organisations that built them. • Self-reported

82.3%

ChartQA

Standard evaluation AI: I'll analyze your performance on the task by evaluating your responses to my prompts and determining your overall capabilities. This is standard evaluation, where I'll be measuring your abilities in reasoning, knowledge retrieval, following instructions, and other important dimensions. I'll observe how well you handle different types of questions and challenges. I may ask follow-up questions or present you with new scenarios to better understand your strengths and limitations. This helps me provide a comprehensive assessment of your AI system's performance. Throughout this evaluation, please respond naturally and to the best of your abilities. There's no need to modify your behavior for the evaluation - I want to assess your typical performance. • Self-reported

81.4%

DocVQA

Standard evaluation AI: gpt-4o • Self-reported

93.2%

MathVista

Standard evaluation • Self-reported

62.4%

MMMU

Standard evaluation AI: I model AI, using set tasks, such how GPQA, AIME, MathVista, PoEMS and CODD. I query for these tasks and how many tasks correctly and which general complexity. I for each tasks: tasks, answer, for and information about complexity. I performance with reference by that indeed tasks. I also data: general percentage success, average time execution query, percentage specific types errors and improvements or performance in specific fields. This evaluation ensures understanding capabilities model and allows compare her/its with other or • Self-reported

55.1%

Other Tests

Specialized benchmarks

BLINK

Standard evaluation AI: We propose using standard evaluation benchmarks to test a range of LLM capabilities and provide initial insights to inform more targeted in-depth evaluations. • Self-reported

61.3%

InfoVQA

# Standard evaluation We evaluation Gemini 1.5 Pro on tasks, using how new, so and existing benchmarks. In this we we present results by five : 1) understanding, 2) reasoning and solution tasks, 3) and translation, 4) instructions and 5) limitations model. from these benchmarks represent itself assignments with multiple choice, or use templates for extraction answers from model. For testing we we use base 0.0, if not For all model, for benchmarks in how «», and our own evaluation other models simply how model, for example GPT-4, Claude 2.1, and also other model Gemini. These internal evaluation can from other results from-for in format answers, models and time testing. If not all tests Gemini 1.5 Pro with context in 1 tokens, even for assignments, which use only part this • Self-reported

72.7%

InterGPS

testmini • Self-reported

48.6%

MMBench

We how language model with evidence, task on steps with clearly For this we set from 82 tasks, requiring conclusions in evidence. includes two : 1. Evaluation abilities model 2. Evaluation abilities model verify correctness evidence We analysis evidence by : 1. : steps in and how well they 2. : Number possible evidence, which model should 3. mathematical knowledge: mathematical knowledge, for execution tasks Each task in our set is evaluated by this We how various LLM with and discovered: 1. All model (including Claude 3 Opus, GPT-4) demonstrate abilities to evidence 2. at correct evidence model often errors at their evaluation 3. Ability evidence not always with ability their shows, that existing LLM with in execution tasks evidence, that indicates on models with capabilities reasoning • Self-reported

86.7%

MMMU-Pro

std/vision In models Claude with view mode "std/vision" represents itself most basic method analysis images. This mode by for processing images, directly in mainly When user in with Claude, model automatically applies its abilities view for analysis this images. Mode std/vision ensures: - images and text in (OCR) - Definition general context and images - how simple, so and complex visual data This mode especially efficient for queries to when is required general description or information, on In difference from more specialized analysis, std/vision applies approach to without on which-or specific processing images • Self-reported

38.5%

OCRBench

Standard evaluation AI: task directly, use own knowledge and Evaluation: accuracy, accuracy, and quality answer on question. : evaluate, correctly whether model answers on question. Not additional explanations about solutions problems. Examples: - about on how works solution for problems • Self-reported

84.4%

POPE

Standard evaluation AI: For creation set assignments we 30 each from which includes 20 We we evaluate following model: • Claude 3 Opus (claude-3-opus-20240229) • Claude 3 Sonnet (claude-3-sonnet-20240229) • Claude 3 Haiku (claude-3-haiku-20240307) • GPT-4 Turbo (gpt-4-turbo-2024-04-09) • GPT-4o (gpt-4o-2024-05-13) • Llama 3 405B (meta-llama/Llama-3-405b-instruct) • Llama 3 70B (meta-llama/Llama-3-70b-instruct) • Llama 3 8B (meta-llama/Llama-3-8b-instruct) • Gemini 1.5 Pro (gemini-1.5-pro-preview-0514) • Gemini 1.0 Pro (gemini-1.0-pro-latest) • Command R (anthropic/claude-3-sonnet-20240229-v1) • Gemini 1.5 Flash (gemini-1.5-flash-preview-0514) In dependency from we various processing prompts or instructions: • : question. • Mode thinking: "Please, thoroughly this." • Chain reasoning: "Let us let's solve this task step for step." • aloud: instructions for model think aloud at solving tasks. We we measure accuracy: percentage correctly solved For each model, and processing we results answers on basis most model • Self-reported

85.6%

ScienceQA Visual

on images In this research we ability models information, in form images, such how mathematical equations and text. important between text and images: - is and model Images require from model analysis information, by two and also between various images For testing these abilities we we offer models images, which contain various information and questions, requiring understanding and reasoning. We evaluate: 1. Ability model exactly interpret information 2. Ability model to reasoning 3. Ability model correctly on specific images in its answers • Self-reported

97.5%

TextVQA

Standard evaluation AI: Allows automatically performance model for tasks, which have answers. When use: For tasks with answers, such how: - problems with answers - Tasks on actual Tasks with choice answers, where exists clearly correct option Advantages: - More and than evaluation - Can process large tests - in evaluation Disadvantages: - efficient for tasks with or answers - Can details in reasoning, if evaluates only final answer - process but correct Examples use: - verification answers model on tasks MMLU or GSM8K - Evaluation accuracy extraction facts from specific Verification efficiency model in tasks classification text • Self-reported

75.6%

Video-MME

16 frames • Self-reported

55.0%

License & Metadata

License

mit

Announcement Date

February 1, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Phi-3.5-vision-instruct

Microsoft

MM4.2B

Released:Aug 2024

Phi-3.5-mini-instruct

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Aug 2024

Price:$0.10/1M tokens

Phi 4 Mini Reasoning

Microsoft

3.8B

Best score:0.5 (GPQA)

Released:Apr 2025

Phi 4 Mini

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Feb 2025

Granite 3.3 8B Instruct

IBM

MM8.0B

Best score:0.9 (HumanEval)

Released:Apr 2025

Gemma 3n E4B

Google

MM8.0B

Best score:0.6 (ARC)

Released:Jun 2025

Granite 3.3 8B Base

IBM

MM8.2B

Best score:0.9 (HumanEval)

Released:Apr 2025

DeepSeek VL2 Tiny

DeepSeek

MM3.0B

Released:Dec 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.