Microsoft logo

Phi-3.5-MoE-instruct

Microsoft

Phi-3.5-MoE-instruct is a Mixture-of-Experts model with approximately 42 billion total parameters (6.6 billion active) and a 128K token context window. It excels at reasoning, math, coding, and multilingual tasks, outperforming larger dense models on many benchmarks. The model underwent a thorough post-training process for safety (SFT + DPO) and is licensed under MIT. This model is ideal for scenarios requiring both efficiency and high performance, especially in multilingual tasks or reasoning-intensive tasks.

Key Specifications

Parameters
60.0B
Context
-
Release Date
August 23, 2024
Average Score
65.6%

Timeline

Key dates in the model's history
Announcement
August 23, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
60.0B
Training Tokens
4.9T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
5-shotSelf-reported
83.8%
MMLU
5-shot evaluation AI: I system for evaluation AI models. She/It is called 5-shot evaluation. I this system for evaluation performance GPT-4o in various (Standard, Turbo and Mini). How she/it works: 1. I 5 simple tasks by each 2. I 5 times each model on each task 3. I results and evaluation by 5-scale : - mathematics (arithmetic, ) - data (by ) - programming (Python) - (basic ) - () - (text) Advantages this approach: - and easily necessity in API or special shows between models Disadvantages: - Not all possible functions - in some evaluationsize sample I results and in order to other could these testsSelf-reported
78.9%
TruthfulQA
10-attempts AI: Method 10-attempts — this method solutions problems, which modern language models. Method based on language models: they on its in context that indeed query, that often leads to results. Instead that in order to answer, method 10-attempts allows model do several attempts solutions problems, showing it her/its errors, in order to she/it could and adapt. Model passes through several iterations, using from previous attempts, in order to understand problem more and more exact answers. Key aspects: • in process: Model its errors in context • accuracy: improvement with each • solution problems: Allows model errors and This method especially efficient for tasks reasoning, where approaches can errors, which can at : 1. provides task 2. Model makes attempt solutions 3. Model independently its answer on errors 4. Model makes attempt, considering problems 5. Process to 10 attempts or to obtaining answer This method to tasks, from mathematical puzzles to reasoning, and represents itself tool for improvements performance LLMSelf-reported
77.5%
Winogrande
5-shotSelf-reported
81.3%

Programming

Programming skills tests
HumanEval
0-shot AI: What such ? AI internal : – this in which at and/or When to above and when – below. This because that at to how would "" with points view (and ), and at – "" (and ). example – help or at her/its and When above, when – below. also is used in for measurement speed and (and ), in for measurement speed in research for measurement speed and in other fields. AI: — this at which at For example, when help you, above by at and below at This because, that and him, whichSelf-reported
70.7%
MBPP
3-shotSelf-reported
80.8%

Mathematics

Mathematical problems and computations
GSM8k
8-shot chain reasoning AI: 8-shot chain-of-thoughtSelf-reported
88.7%
MATH
0-shot chain-of-thought Zero-shot Chain-of-Thought (0-shot CoT) — this method prompting language models perform reasoning step for step. and (2022), method consists in simple prompts "Let's let's think step for step" to This prompt encourages model provide intermediate reasoning before answer. 0-shot CoT especially efficient for tasks, requiring step-by-step solutions, such how mathematical tasks or output. reason step by step, method often allows models more high accuracy by comparison with answers. This because, that model break down complex problems on more managed steps, that probability errors. What most important, in difference from more method few-shot chain-of-thought, 0-shot CoT not requires examples correct reasoning. This makes its more for application in various tasks, especially when examples reasoning difficult obtain or when such examples can model in own reasoningSelf-reported
59.5%
MGSM
0-shot chain-of-thought Chain-of-thought (CoT, chain thinking) — this method, which encourages language model intermediate steps reasoning before provision final answer. 0-shot CoT relates to to CoT without any-or examples. Model "think step for step" without demonstration that, how chain thinking. reasoning especially for complex tasks reasoning, such how mathematical tasks, tasks reasoning, and other requiring multi-step thinkingSelf-reported
58.7%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
0-shot chain-of-thought Method analysis, at which model explicitly its answer step by step, without provision examples for How model receives instruction «step for step» or prompt, which encourages her/its task before than give final answerSelf-reported
79.1%
GPQA
0-shot chain-of-thought AI: 0-shot chain-of-thoughtSelf-reported
36.8%

Other Tests

Specialized benchmarks
ARC-C
10-shot In with examples (10-shot ) we we provide several (10) examples solutions tasks from that indeed categories. These examples that, how should answer on question. Model should templates in examples and their on new task. This especially useful for models, not to tasks, thinking (for example, solution mathematical tasks, ). examples helps model understand format answer and use in examples. 10-shot approach usually outperforms queries without examples (0-shot) or with one example (1-shot). number examples gives model more for that especially important at work with complex tasks, approach. This method that, how people on examples, before than to solving new tasks in fieldSelf-reported
91.0%
Arena Hard
standard evaluationSelf-reported
37.9%
BoolQ
2-shotSelf-reported
84.6%
GovReport
standard evaluationSelf-reported
26.4%
MEGA MLQA
standard evaluationSelf-reported
65.3%
MEGA TyDi QA
standard evaluationSelf-reported
67.1%
MEGA UDPOS
standard evaluationSelf-reported
60.4%
MEGA XCOPA
standard evaluationSelf-reported
76.6%
MEGA XStoryCloze
standard evaluationSelf-reported
82.8%
MMLU-Pro
standard evaluationSelf-reported
45.3%
MMMLU
5-shot evaluation For each example execute: 1. model test and context tasks. 2. model generate answer. 3. answer on basis solutions. 4. five various experts for evaluation answer by scale from 1 to 5, where 1 means "fully incorrectly", and 5 - "fully correctly". 5. score from all experts for obtaining evaluation. This methodology allows conduct more evaluation performance model, especially for complex tasks, where can be several possible approaches to solving. Use five independent evaluations and ensures more quality answerSelf-reported
69.9%
OpenBookQA
10-approach In 10-approach model answer on that indeed question, each answer one consists in that, in order to obtain diverse answers, which can identify various aspects understanding model. Model can use different methods reasoning, different or consider various cases. This also helps evaluate output model. If majority answers this can on high confidence model. if answers this can about 10-approach especially useful for evaluation or mathematical abilities models. He shows, can whether model to various or randomly on correct answer, not problems fullySelf-reported
89.6%
PIQA
5-shot This method, when LLM receives 5 previous examples, each with query and answer, before than its ask execute new task. In difference from zero-shot (without examples) or one-shot (with one example), 5-shot gives more context and allows model better understand template answers. Demonstration 5 examples usually substantially improves ability model follow format and quality answers. This especially useful for tasks with structure or fields knowledge, where model can on Research show, that performance often with number examples, although is 5-shot ensures between number context and using tokensSelf-reported
88.6%
Qasper
standard evaluationSelf-reported
40.0%
QMSum
standard evaluationSelf-reported
19.9%
RepoQA
AverageSelf-reported
85.0%
RULER
Evaluation in conditions context (128K) AI: Long-context performance is essential for many practical applications, including question-answering over lengthy content, document analysis, and multi-document processing. We evaluated SOTA LLMs on the Long-context Multi-evidence Question Answering (LongQA) benchmark introduced by Claude 2.1. The LongQA dataset consists of 550 multi-evidence questions that require finding and synthesizing information spread across a long document. Model responses are judged by comparing them to expert-written reference answers. For a rigorous evaluation, we used context windows of approximately 128K tokens and measured overall scores as well as performance at different context positions to assess the models' ability to maintain attention over long contextsSelf-reported
87.1%
Social IQa
5-shotSelf-reported
78.0%
SQuALITY
standard evaluationSelf-reported
24.1%
SummScreenFD
standard evaluationSelf-reported
16.9%

License & Metadata

License
mit
Announcement Date
August 23, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.