Microsoft logo

Phi-3.5-vision-instruct

Multimodal
Microsoft

Phi-3.5-vision-instruct is an open multimodal model with 4.2 billion parameters and support for up to 128K token context window. The model specializes in understanding and analyzing multiple image frames, improving performance on single-image benchmarks while enabling multi-image comparison, summarization, and even video analysis. The model underwent post-training for safety to improve instruction following, alignment, and robust handling of visual and text inputs, and is released under the MIT license.

Key Specifications

Parameters
4.2B
Context
-
Release Date
August 23, 2024
Average Score
68.3%

Timeline

Key dates in the model's history
Announcement
August 23, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
4.2B
Training Tokens
500.0B tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data
AI2D
standard evaluationSelf-reported
78.1%
ChartQA
standard evaluationSelf-reported
81.8%
MathVista
standard evaluationSelf-reported
43.9%
MMMU
standard evaluationSelf-reported
43.0%

Other Tests

Specialized benchmarks
InterGPS
standard evaluationSelf-reported
36.3%
MMBench
standard evaluationSelf-reported
81.9%
POPE
standard evaluationSelf-reported
86.1%
ScienceQA
standard evaluationSelf-reported
91.3%
TextVQA
standard evaluationSelf-reported
72.0%

License & Metadata

License
mit
Announcement Date
August 23, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.