Phi-3.5-vision-instruct
MultimodalPhi-3.5-vision-instruct is an open multimodal model with 4.2 billion parameters and support for up to 128K token context window. The model specializes in understanding and analyzing multiple image frames, improving performance on single-image benchmarks while enabling multi-image comparison, summarization, and even video analysis. The model underwent post-training for safety to improve instruction following, alignment, and robust handling of visual and text inputs, and is released under the MIT license.
Key Specifications
Timeline
Technical Specifications
Benchmark Results
Model performance metrics across various tests and benchmarks
Multimodal
Other Tests
License & Metadata
Similar Models
All ModelsPhi-4-multimodal-instruct
Microsoft
Phi-3.5-mini-instruct
Microsoft
Phi 4 Mini Reasoning
Microsoft
Phi 4 Mini
Microsoft
Gemma 3n E2B Instructed LiteRT (Preview)
Gemma 3n E4B Instructed
Gemma 3n E2B Instructed
Granite 3.3 8B Instruct
IBM
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.