Phi-3.5-vision-instruct

Multimodal

Microsoft

Phi-3.5-vision-instruct is an open multimodal model with 4.2 billion parameters and support for up to 128K token context window. The model specializes in understanding and analyzing multiple image frames, improving performance on single-image benchmarks while enabling multi-image comparison, summarization, and even video analysis. The model underwent post-training for safety to improve instruction following, alignment, and robust handling of visual and text inputs, and is released under the MIT license.

Key Specifications

Parameters

4.2B

Context

-

Release Date

August 23, 2024

Average Score

68.3%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

August 23, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

4.2B

Training Tokens

500.0B tokens

Knowledge Cutoff

-

Family

-

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data

AI2D

standard evaluation • Self-reported

78.1%

ChartQA

standard evaluation • Self-reported

81.8%

MathVista

standard evaluation • Self-reported

43.9%

MMMU

standard evaluation • Self-reported

43.0%

Other Tests

Specialized benchmarks

InterGPS

standard evaluation • Self-reported

36.3%

MMBench

standard evaluation • Self-reported

81.9%

POPE

standard evaluation • Self-reported

86.1%

ScienceQA

standard evaluation • Self-reported

91.3%

TextVQA

standard evaluation • Self-reported

72.0%

License & Metadata

License

mit

Announcement Date

August 23, 2024

Last Updated

July 19, 2025

Similar Models

Phi-4-multimodal-instruct

Microsoft

Released:Feb 2025

Price:$0.05/1M tokens

Phi-3.5-mini-instruct

Microsoft

Best score:0.8 (ARC)

Released:Aug 2024

Price:$0.10/1M tokens

Phi 4 Mini Reasoning

Microsoft

Best score:0.5 (GPQA)

Released:Apr 2025

Phi 4 Mini

Microsoft

Best score:0.8 (ARC)

Released:Feb 2025

Gemma 3n E2B Instructed LiteRT (Preview)

Google

Best score:0.7 (HumanEval)

Released:May 2025

Gemma 3n E4B Instructed

Google

Best score:0.8 (HumanEval)

Released:Jun 2025

Price:$20.00/1M tokens

Gemma 3n E2B Instructed

Google

Best score:0.7 (HumanEval)

Released:Jun 2025

Granite 3.3 8B Instruct

IBM

Best score:0.9 (HumanEval)

Released:Apr 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.