Phi-3.5-MoE-instruct

Name: Phi-3.5-MoE-instruct
Author: Microsoft

Microsoft

Phi-3.5-MoE-instruct is a Mixture-of-Experts model with approximately 42 billion total parameters (6.6 billion active) and a 128K token context window. It excels at reasoning, math, coding, and multilingual tasks, outperforming larger dense models on many benchmarks. The model underwent a thorough post-training process for safety (SFT + DPO) and is licensed under MIT. This model is ideal for scenarios requiring both efficiency and high performance, especially in multilingual tasks or reasoning-intensive tasks.

Key Specifications

Parameters

60.0B

Context

Release Date

August 23, 2024

Average Score

65.6%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

August 23, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

60.0B

Training Tokens

4.9T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

5-shot • Self-reported

83.8%

MMLU

5-shot evaluation AI: I system for evaluation AI models. She/It is called 5-shot evaluation. I this system for evaluation performance GPT-4o in various (Standard, Turbo and Mini). How she/it works: 1. I 5 simple tasks by each 2. I 5 times each model on each task 3. I results and evaluation by 5-scale : - mathematics (arithmetic, ) - data (by ) - programming (Python) - (basic ) - () - (text) Advantages this approach: - and easily necessity in API or special shows between models Disadvantages: - Not all possible functions - in some evaluationsize sample I results and in order to other could these tests • Self-reported

78.9%

TruthfulQA

10-attempts AI: Method 10-attempts — this method solutions problems, which modern language models. Method based on language models: they on its in context that indeed query, that often leads to results. Instead that in order to answer, method 10-attempts allows model do several attempts solutions problems, showing it her/its errors, in order to she/it could and adapt. Model passes through several iterations, using from previous attempts, in order to understand problem more and more exact answers. Key aspects: • in process: Model its errors in context • accuracy: improvement with each • solution problems: Allows model errors and This method especially efficient for tasks reasoning, where approaches can errors, which can at : 1. provides task 2. Model makes attempt solutions 3. Model independently its answer on errors 4. Model makes attempt, considering problems 5. Process to 10 attempts or to obtaining answer This method to tasks, from mathematical puzzles to reasoning, and represents itself tool for improvements performance LLM • Self-reported

77.5%

Winogrande

5-shot • Self-reported

81.3%

Programming

Programming skills tests

HumanEval

0-shot AI: What such ? AI internal : – this in which at and/or When to above and when – below. This because that at to how would "" with points view (and ), and at – "" (and ). example – help or at her/its and When above, when – below. also is used in for measurement speed and (and ), in for measurement speed in research for measurement speed and in other fields. AI: — this at which at For example, when help you, above by at and below at This because, that and him, which • Self-reported

70.7%

MBPP

3-shot • Self-reported

80.8%

Mathematics

Mathematical problems and computations

GSM8k

8-shot chain reasoning AI: 8-shot chain-of-thought • Self-reported

88.7%

MATH

0-shot chain-of-thought Zero-shot Chain-of-Thought (0-shot CoT) — this method prompting language models perform reasoning step for step. and (2022), method consists in simple prompts "Let's let's think step for step" to This prompt encourages model provide intermediate reasoning before answer. 0-shot CoT especially efficient for tasks, requiring step-by-step solutions, such how mathematical tasks or output. reason step by step, method often allows models more high accuracy by comparison with answers. This because, that model break down complex problems on more managed steps, that probability errors. What most important, in difference from more method few-shot chain-of-thought, 0-shot CoT not requires examples correct reasoning. This makes its more for application in various tasks, especially when examples reasoning difficult obtain or when such examples can model in own reasoning • Self-reported

59.5%

MGSM

0-shot chain-of-thought Chain-of-thought (CoT, chain thinking) — this method, which encourages language model intermediate steps reasoning before provision final answer. 0-shot CoT relates to to CoT without any-or examples. Model "think step for step" without demonstration that, how chain thinking. reasoning especially for complex tasks reasoning, such how mathematical tasks, tasks reasoning, and other requiring multi-step thinking • Self-reported

58.7%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

0-shot chain-of-thought Method analysis, at which model explicitly its answer step by step, without provision examples for How model receives instruction «step for step» or prompt, which encourages her/its task before than give final answer • Self-reported

79.1%

GPQA

0-shot chain-of-thought AI: 0-shot chain-of-thought • Self-reported

36.8%

Other Tests

Specialized benchmarks

ARC-C

10-shot In with examples (10-shot ) we we provide several (10) examples solutions tasks from that indeed categories. These examples that, how should answer on question. Model should templates in examples and their on new task. This especially useful for models, not to tasks, thinking (for example, solution mathematical tasks, ). examples helps model understand format answer and use in examples. 10-shot approach usually outperforms queries without examples (0-shot) or with one example (1-shot). number examples gives model more for that especially important at work with complex tasks, approach. This method that, how people on examples, before than to solving new tasks in field • Self-reported

91.0%

Arena Hard

standard evaluation • Self-reported

37.9%

BoolQ

2-shot • Self-reported

84.6%

GovReport

standard evaluation • Self-reported

26.4%

MEGA MLQA

standard evaluation • Self-reported

65.3%

MEGA TyDi QA

standard evaluation • Self-reported

67.1%

MEGA UDPOS

standard evaluation • Self-reported

60.4%

MEGA XCOPA

standard evaluation • Self-reported

76.6%

MEGA XStoryCloze

standard evaluation • Self-reported

82.8%

MMLU-Pro

standard evaluation • Self-reported

45.3%

MMMLU

5-shot evaluation For each example execute: 1. model test and context tasks. 2. model generate answer. 3. answer on basis solutions. 4. five various experts for evaluation answer by scale from 1 to 5, where 1 means "fully incorrectly", and 5 - "fully correctly". 5. score from all experts for obtaining evaluation. This methodology allows conduct more evaluation performance model, especially for complex tasks, where can be several possible approaches to solving. Use five independent evaluations and ensures more quality answer • Self-reported

69.9%

OpenBookQA

10-approach In 10-approach model answer on that indeed question, each answer one consists in that, in order to obtain diverse answers, which can identify various aspects understanding model. Model can use different methods reasoning, different or consider various cases. This also helps evaluate output model. If majority answers this can on high confidence model. if answers this can about 10-approach especially useful for evaluation or mathematical abilities models. He shows, can whether model to various or randomly on correct answer, not problems fully • Self-reported

89.6%

PIQA

5-shot This method, when LLM receives 5 previous examples, each with query and answer, before than its ask execute new task. In difference from zero-shot (without examples) or one-shot (with one example), 5-shot gives more context and allows model better understand template answers. Demonstration 5 examples usually substantially improves ability model follow format and quality answers. This especially useful for tasks with structure or fields knowledge, where model can on Research show, that performance often with number examples, although is 5-shot ensures between number context and using tokens • Self-reported

88.6%

Qasper

standard evaluation • Self-reported

40.0%

QMSum

standard evaluation • Self-reported

19.9%

RepoQA

Average • Self-reported

85.0%

RULER

Evaluation in conditions context (128K) AI: Long-context performance is essential for many practical applications, including question-answering over lengthy content, document analysis, and multi-document processing. We evaluated SOTA LLMs on the Long-context Multi-evidence Question Answering (LongQA) benchmark introduced by Claude 2.1. The LongQA dataset consists of 550 multi-evidence questions that require finding and synthesizing information spread across a long document. Model responses are judged by comparing them to expert-written reference answers. For a rigorous evaluation, we used context windows of approximately 128K tokens and measured overall scores as well as performance at different context positions to assess the models' ability to maintain attention over long contexts • Self-reported

87.1%

Social IQa

5-shot • Self-reported

78.0%

SQuALITY

standard evaluation • Self-reported

24.1%

SummScreenFD

standard evaluation • Self-reported

16.9%