Microsoft logo

Phi-4-multimodal-instruct

Multimodal
Microsoft

Phi-4-multimodal-instruct is a lightweight (5.57 billion parameter) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context window. Enhanced with SFT, DPO, and RLHF for instruction following and safety.

Key Specifications

Parameters
5.6B
Context
128.0K
Release Date
February 1, 2025
Average Score
72.0%

Timeline

Key dates in the model's history
Announcement
February 1, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
5.6B
Training Tokens
5.0T tokens
Knowledge Cutoff
June 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.05
Output (per 1M tokens)
$0.10
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data
AI2D
Standard evaluation AI: There's an ongoing discussion about whether LLMs should be evaluated only on their raw capabilities, or on the basis of how they're fine-tuned to comply with the policies of the organisations that built them.Self-reported
82.3%
ChartQA
Standard evaluation AI: I'll analyze your performance on the task by evaluating your responses to my prompts and determining your overall capabilities. This is standard evaluation, where I'll be measuring your abilities in reasoning, knowledge retrieval, following instructions, and other important dimensions. I'll observe how well you handle different types of questions and challenges. I may ask follow-up questions or present you with new scenarios to better understand your strengths and limitations. This helps me provide a comprehensive assessment of your AI system's performance. Throughout this evaluation, please respond naturally and to the best of your abilities. There's no need to modify your behavior for the evaluation - I want to assess your typical performance.Self-reported
81.4%
DocVQA
Standard evaluation AI: gpt-4oSelf-reported
93.2%
MathVista
Standard evaluationSelf-reported
62.4%
MMMU
Standard evaluation AI: I model AI, using set tasks, such how GPQA, AIME, MathVista, PoEMS and CODD. I query for these tasks and how many tasks correctly and which general complexity. I for each tasks: tasks, answer, for and information about complexity. I performance with reference by that indeed tasks. I also data: general percentage success, average time execution query, percentage specific types errors and improvements or performance in specific fields. This evaluation ensures understanding capabilities model and allows compare her/its with other orSelf-reported
55.1%

Other Tests

Specialized benchmarks
BLINK
Standard evaluation AI: We propose using standard evaluation benchmarks to test a range of LLM capabilities and provide initial insights to inform more targeted in-depth evaluations.Self-reported
61.3%
InfoVQA
# Standard evaluation We evaluation Gemini 1.5 Pro on tasks, using how new, so and existing benchmarks. In this we we present results by five : 1) understanding, 2) reasoning and solution tasks, 3) and translation, 4) instructions and 5) limitations model. from these benchmarks represent itself assignments with multiple choice, or use templates for extraction answers from model. For testing we we use base 0.0, if not For all model, for benchmarks in how «», and our own evaluation other models simply how model, for example GPT-4, Claude 2.1, and also other model Gemini. These internal evaluation can from other results from-for in format answers, models and time testing. If not all tests Gemini 1.5 Pro with context in 1 tokens, even for assignments, which use only part thisSelf-reported
72.7%
InterGPS
testminiSelf-reported
48.6%
MMBench
We how language model with evidence, task on steps with clearly For this we set from 82 tasks, requiring conclusions in evidence. includes two : 1. Evaluation abilities model 2. Evaluation abilities model verify correctness evidence We analysis evidence by : 1. : steps in and how well they 2. : Number possible evidence, which model should 3. mathematical knowledge: mathematical knowledge, for execution tasks Each task in our set is evaluated by this We how various LLM with and discovered: 1. All model (including Claude 3 Opus, GPT-4) demonstrate abilities to evidence 2. at correct evidence model often errors at their evaluation 3. Ability evidence not always with ability their shows, that existing LLM with in execution tasks evidence, that indicates on models with capabilities reasoningSelf-reported
86.7%
MMMU-Pro
std/vision In models Claude with view mode "std/vision" represents itself most basic method analysis images. This mode by for processing images, directly in mainly When user in with Claude, model automatically applies its abilities view for analysis this images. Mode std/vision ensures: - images and text in (OCR) - Definition general context and images - how simple, so and complex visual data This mode especially efficient for queries to when is required general description or information, on In difference from more specialized analysis, std/vision applies approach to without on which-or specific processing imagesSelf-reported
38.5%
OCRBench
Standard evaluation AI: task directly, use own knowledge and Evaluation: accuracy, accuracy, and quality answer on question. : evaluate, correctly whether model answers on question. Not additional explanations about solutions problems. Examples: - about on how works solution for problemsSelf-reported
84.4%
POPE
Standard evaluation AI: For creation set assignments we 30 each from which includes 20 We we evaluate following model: • Claude 3 Opus (claude-3-opus-20240229) • Claude 3 Sonnet (claude-3-sonnet-20240229) • Claude 3 Haiku (claude-3-haiku-20240307) • GPT-4 Turbo (gpt-4-turbo-2024-04-09) • GPT-4o (gpt-4o-2024-05-13) • Llama 3 405B (meta-llama/Llama-3-405b-instruct) • Llama 3 70B (meta-llama/Llama-3-70b-instruct) • Llama 3 8B (meta-llama/Llama-3-8b-instruct) • Gemini 1.5 Pro (gemini-1.5-pro-preview-0514) • Gemini 1.0 Pro (gemini-1.0-pro-latest) • Command R (anthropic/claude-3-sonnet-20240229-v1) • Gemini 1.5 Flash (gemini-1.5-flash-preview-0514) In dependency from we various processing prompts or instructions: • : question. • Mode thinking: "Please, thoroughly this." • Chain reasoning: "Let us let's solve this task step for step." • aloud: instructions for model think aloud at solving tasks. We we measure accuracy: percentage correctly solved For each model, and processing we results answers on basis most modelSelf-reported
85.6%
ScienceQA Visual
on images In this research we ability models information, in form images, such how mathematical equations and text. important between text and images: - is and model Images require from model analysis information, by two and also between various images For testing these abilities we we offer models images, which contain various information and questions, requiring understanding and reasoning. We evaluate: 1. Ability model exactly interpret information 2. Ability model to reasoning 3. Ability model correctly on specific images in its answersSelf-reported
97.5%
TextVQA
Standard evaluation AI: Allows automatically performance model for tasks, which have answers. When use: For tasks with answers, such how: - problems with answers - Tasks on actual Tasks with choice answers, where exists clearly correct option Advantages: - More and than evaluation - Can process large tests - in evaluation Disadvantages: - efficient for tasks with or answers - Can details in reasoning, if evaluates only final answer - process but correct Examples use: - verification answers model on tasks MMLU or GSM8K - Evaluation accuracy extraction facts from specific Verification efficiency model in tasks classification textSelf-reported
75.6%
Video-MME
16 framesSelf-reported
55.0%

License & Metadata

License
mit
Announcement Date
February 1, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.