Phi-4-multimodal-instruct
MultimodalPhi-4-multimodal-instruct is a lightweight (5.57 billion parameter) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context window. Enhanced with SFT, DPO, and RLHF for instruction following and safety.
Key Specifications
Parameters
5.6B
Context
128.0K
Release Date
February 1, 2025
Average Score
72.0%
Timeline
Key dates in the model's history
Announcement
February 1, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
5.6B
Training Tokens
5.0T tokens
Knowledge Cutoff
June 1, 2024
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.05
Output (per 1M tokens)
$0.10
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Multimodal
Working with images and visual data
AI2D
Standard evaluation
AI: There's an ongoing discussion about whether LLMs should be evaluated only on their raw capabilities, or on the basis of how they're fine-tuned to comply with the policies of the organisations that built them. • Self-reported
ChartQA
Standard evaluation
AI: I'll analyze your performance on the task by evaluating your responses to my prompts and determining your overall capabilities.
This is standard evaluation, where I'll be measuring your abilities in reasoning, knowledge retrieval, following instructions, and other important dimensions. I'll observe how well you handle different types of questions and challenges.
I may ask follow-up questions or present you with new scenarios to better understand your strengths and limitations. This helps me provide a comprehensive assessment of your AI system's performance.
Throughout this evaluation, please respond naturally and to the best of your abilities. There's no need to modify your behavior for the evaluation - I want to assess your typical performance. • Self-reported
DocVQA
Standard evaluation
AI: gpt-4o • Self-reported
MathVista
Standard evaluation • Self-reported
MMMU
Standard evaluation AI: I model AI, using set tasks, such how GPQA, AIME, MathVista, PoEMS and CODD. I query for these tasks and how many tasks correctly and which general complexity. I for each tasks: tasks, answer, for and information about complexity. I performance with reference by that indeed tasks. I also data: general percentage success, average time execution query, percentage specific types errors and improvements or performance in specific fields. This evaluation ensures understanding capabilities model and allows compare her/its with other or • Self-reported
Other Tests
Specialized benchmarks
BLINK
Standard evaluation
AI: We propose using standard evaluation benchmarks to test a range of LLM capabilities and provide initial insights to inform more targeted in-depth evaluations. • Self-reported
InfoVQA
# Standard evaluation We evaluation Gemini 1.5 Pro on tasks, using how new, so and existing benchmarks. In this we we present results by five : 1) understanding, 2) reasoning and solution tasks, 3) and translation, 4) instructions and 5) limitations model. from these benchmarks represent itself assignments with multiple choice, or use templates for extraction answers from model. For testing we we use base 0.0, if not For all model, for benchmarks in how «», and our own evaluation other models simply how model, for example GPT-4, Claude 2.1, and also other model Gemini. These internal evaluation can from other results from-for in format answers, models and time testing. If not all tests Gemini 1.5 Pro with context in 1 tokens, even for assignments, which use only part this • Self-reported
InterGPS
testmini • Self-reported
MMBench
We how language model with evidence, task on steps with clearly For this we set from 82 tasks, requiring conclusions in evidence. includes two : 1. Evaluation abilities model 2. Evaluation abilities model verify correctness evidence We analysis evidence by : 1. : steps in and how well they 2. : Number possible evidence, which model should 3. mathematical knowledge: mathematical knowledge, for execution tasks Each task in our set is evaluated by this We how various LLM with and discovered: 1. All model (including Claude 3 Opus, GPT-4) demonstrate abilities to evidence 2. at correct evidence model often errors at their evaluation 3. Ability evidence not always with ability their shows, that existing LLM with in execution tasks evidence, that indicates on models with capabilities reasoning • Self-reported
MMMU-Pro
std/vision In models Claude with view mode "std/vision" represents itself most basic method analysis images. This mode by for processing images, directly in mainly When user in with Claude, model automatically applies its abilities view for analysis this images. Mode std/vision ensures: - images and text in (OCR) - Definition general context and images - how simple, so and complex visual data This mode especially efficient for queries to when is required general description or information, on In difference from more specialized analysis, std/vision applies approach to without on which-or specific processing images • Self-reported
OCRBench
Standard evaluation AI: task directly, use own knowledge and Evaluation: accuracy, accuracy, and quality answer on question. : evaluate, correctly whether model answers on question. Not additional explanations about solutions problems. Examples: - about on how works solution for problems • Self-reported
POPE
Standard evaluation AI: For creation set assignments we 30 each from which includes 20 We we evaluate following model: • Claude 3 Opus (claude-3-opus-20240229) • Claude 3 Sonnet (claude-3-sonnet-20240229) • Claude 3 Haiku (claude-3-haiku-20240307) • GPT-4 Turbo (gpt-4-turbo-2024-04-09) • GPT-4o (gpt-4o-2024-05-13) • Llama 3 405B (meta-llama/Llama-3-405b-instruct) • Llama 3 70B (meta-llama/Llama-3-70b-instruct) • Llama 3 8B (meta-llama/Llama-3-8b-instruct) • Gemini 1.5 Pro (gemini-1.5-pro-preview-0514) • Gemini 1.0 Pro (gemini-1.0-pro-latest) • Command R (anthropic/claude-3-sonnet-20240229-v1) • Gemini 1.5 Flash (gemini-1.5-flash-preview-0514) In dependency from we various processing prompts or instructions: • : question. • Mode thinking: "Please, thoroughly this." • Chain reasoning: "Let us let's solve this task step for step." • aloud: instructions for model think aloud at solving tasks. We we measure accuracy: percentage correctly solved For each model, and processing we results answers on basis most model • Self-reported
ScienceQA Visual
on images In this research we ability models information, in form images, such how mathematical equations and text. important between text and images: - is and model Images require from model analysis information, by two and also between various images For testing these abilities we we offer models images, which contain various information and questions, requiring understanding and reasoning. We evaluate: 1. Ability model exactly interpret information 2. Ability model to reasoning 3. Ability model correctly on specific images in its answers • Self-reported
TextVQA
Standard evaluation AI: Allows automatically performance model for tasks, which have answers. When use: For tasks with answers, such how: - problems with answers - Tasks on actual Tasks with choice answers, where exists clearly correct option Advantages: - More and than evaluation - Can process large tests - in evaluation Disadvantages: - efficient for tasks with or answers - Can details in reasoning, if evaluates only final answer - process but correct Examples use: - verification answers model on tasks MMLU or GSM8K - Evaluation accuracy extraction facts from specific Verification efficiency model in tasks classification text • Self-reported
Video-MME
16 frames • Self-reported
License & Metadata
License
mit
Announcement Date
February 1, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsPhi-3.5-vision-instruct
Microsoft
MM4.2B
Released:Aug 2024
Phi-3.5-mini-instruct
Microsoft
3.8B
Best score:0.8 (ARC)
Released:Aug 2024
Price:$0.10/1M tokens
Phi 4 Mini Reasoning
Microsoft
3.8B
Best score:0.5 (GPQA)
Released:Apr 2025
Phi 4 Mini
Microsoft
3.8B
Best score:0.8 (ARC)
Released:Feb 2025
Granite 3.3 8B Instruct
IBM
MM8.0B
Best score:0.9 (HumanEval)
Released:Apr 2025
Gemma 3n E4B
MM8.0B
Best score:0.6 (ARC)
Released:Jun 2025
Granite 3.3 8B Base
IBM
MM8.2B
Best score:0.9 (HumanEval)
Released:Apr 2025
DeepSeek VL2 Tiny
DeepSeek
MM3.0B
Released:Dec 2024
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.