Kimi K2 Instruct
Kimi K2 Instruct is the instruction-tuned version of Kimi K2, a Mixture-of-Experts (MoE) language model by Moonshot AI with 1 trillion parameters and 32 billion active per forward pass. Optimized for following instructions, multi-turn conversation, and agentic use cases. Supports context windows up to 128K tokens and excels in coding, reasoning, and tool use tasks.
Key Specifications
Parameters
1.0T
Context
131.1K
Release Date
January 1, 2025
Average Score
66.7%
Timeline
Key dates in the model's history
Announcement
January 1, 2025
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
1.0T
Training Tokens
15.5T tokens
Knowledge Cutoff
-
Family
-
Fine-tuned from
kimi-k2-base
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.57
Output (per 1M tokens)
$2.30
Max Input Tokens
131.1K
Max Output Tokens
131.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Standard evaluation • Self-reported
Programming
Programming skills tests
HumanEval
Pass@1 Pass@1 Pass@k, k Pass@1 Pass@1 Pass@1 Pass@1 • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
Accuracy
AI: I'll compute the accuracy of the model's responses by comparing them to the ground truth answers.
For multiple-choice questions, I'll check if the model selected the correct option (A, B, C, or D).
For open-ended questions that require numerical answers, I'll check if the model's final answer matches the correct value. I'll be lenient with formatting differences (e.g., "5" vs "5.0" vs "five").
For open-ended questions that require textual answers, I'll assess whether the model's response contains the key elements of the correct answer.
I'll report the overall accuracy as the percentage of questions answered correctly, and break down performance by question type and difficulty level. • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Diamond Avg@8 Diamond Avg@8 — 8 Diamond Avg@8 : 1. 2. 3. 8 4. 8 Diamond Avg@8 • Self-reported
Other Tests
Specialized benchmarks
AceBench
Accuracy
AI: 0 • Self-reported
Aider-Polyglot
Accuracy
AI: 0.0/1.0 • Self-reported
AIME 2024
64 • Self-reported
AIME 2025
Avg@64 • Self-reported
AutoLogi
Accuracy
AI • Self-reported
CBNSL
Accuracy
AI: "Accuracy" refers to how often a model makes correct predictions or provides correct answers.
For simple tasks like "Is this image a cat or a dog?", accuracy is straightforward - the percentage of correct classifications.
For complex tasks like answering multi-step math problems or open-ended questions, accuracy becomes more nuanced:
1. Partial correctness may apply (getting part of a multi-step solution right)
2. Multiple valid answers may exist
3. Context and interpretation matter
When evaluating large language models, accuracy can be measured through:
- Benchmark performance (scores on standardized tests)
- Human evaluation (experts judging correctness)
- Comparison to reference answers
- Self-consistency (agreement across multiple attempts)
Improving accuracy typically involves:
- More/better training data
- Enhanced model architectures
- Better fine-tuning techniques
- Improved prompting methods
High accuracy is critical for high-stakes applications but must be balanced with other considerations like speed, transparency, and resource efficiency. • Self-reported
CNMO 2024
16 • Self-reported
CSimpleQA
Correct • Self-reported
HMMT 2025
Avg@32
AI: * • Self-reported
HumanEval-ER
Pass@1 Pass@1 n=k (20 ), evaluation Pass@1, Pass@k k • Self-reported
Humanity's Last Exam
Accuracy () • Self-reported
IFEval
"", LLM : - "X" - "" - "" • Self-reported
LiveBench
Pass@1 — Pass@1 Pass@1 (n=100) • Self-reported
LiveCodeBench v6
Pass@1 Pass@1 Pass@1 : 1. 2. 3. () 4. : Pass@1 = () / () Pass@1 Pass@k (), Pass@1 • Self-reported
MATH-500
Accuracy AI Accuracy : accuracy - GPT-4o, accuracy, accuracy accuracy GPT-4o • Self-reported
MMLU-Pro
EM • Self-reported
MMLU-Redux
EM • Self-reported
MultiChallenge
Accuracy
AI: [model] is a powerful artificial intelligence language model developed by OpenAI. In this test, we assess its accuracy in answering questions correctly. Accuracy refers to the model's ability to provide factually correct responses without making errors or generating false information.
To evaluate accuracy, we present the model with questions that have verifiable answers across different domains including science, history, mathematics, and general knowledge. We then compare the model's answers against established facts from reliable sources.
Factors affecting accuracy include:
1. Knowledge cutoff limitations
2. Training data quality and comprehensiveness
3. Inherent limitations in pattern recognition
4. Prompt specificity and clarity
The accuracy assessment gives us insight into how reliable [model] is as an information source and helps identify areas where additional training or improvements may be needed. • Self-reported
MultiPL-E
# Pass@1 Pass@1 - : 1. 2. () 3. Pass@1 - Pass@1 • Self-reported
MuSR
Pass@1 AI: Pass@1 Pass@1, 1000 (GSM8K) • Self-reported
OJBench
Pass@1 Pass@1 Pass@1 Pass@1 Pass@k (k > 1) • Self-reported
PolyMath-en
4 • Self-reported
SimpleQA
Standard evaluation • Self-reported
SuperGPQA
Accuracy AI: • Self-reported
SWE-bench Multilingual
Standard evaluation • Self-reported
SWE-bench Verified (Agentic Coding)
Standard evaluation • Self-reported
SWE-bench Verified (Agentless)
thinking, : - : : : - : : • Self-reported
SWE-bench Verified (Multiple Attempts)
temperature top_p, seed. : 1. : 2. Score : 3. : 4. evaluation: • Self-reported
Tau2 airline
Avg@4 Avg@4 4 4 • Self-reported
Tau2 retail
4 • Self-reported
Tau2 telecom
Avg@4 4 4 • Self-reported
Terminal-bench
AI: evaluation LLM • • — • evaluation — evaluation • • • : • • • • accuracy • Self-reported
Terminus
Accuracy
AI: The ability to correctly predict outputs compared to ground truth. • Self-reported
ZebraLogic
Accuracy
ChatGPT
AI: I'm going to solve this step-by-step.
To find the smallest positive integer k such that a^k ≡ 1 (mod n), I need to determine the order of a modulo n.
Given:
- n = 15
- a = 4
First, I'll check if a is relatively prime to n by computing gcd(a,n) = gcd(4,15).
15 = 4*3 + 3
4 = 3*1 + 1
3 = 1*3 + 0
So gcd(4,15) = 1, which means a and n are relatively prime.
Now I'll compute powers of a modulo n:
4^1 ≡ 4 (mod 15)
4^2 ≡ 16 ≡ 1 (mod 15)
So 4^2 ≡ 1 (mod 15), which means the smallest positive integer k such that a^k ≡ 1 (mod n) is k = 2.
Therefore, k = 2 is the answer. • Self-reported
License & Metadata
License
modified_mit_license
Announcement Date
January 1, 2025
Last Updated
July 19, 2025
Articles about Kimi K2 Instruct
Similar Models
All ModelsKimi K2 0905
Moonshot AI
1.0T
Best score:0.9 (HumanEval)
Released:Sep 2025
Price:$0.60/1M tokens
Kimi K2-Instruct-0905
Moonshot AI
1.0T
Best score:0.9 (MMLU)
Released:Sep 2025
Price:$0.60/1M tokens
Kimi K2-Thinking-0905
Moonshot AI
1.0T
Best score:0.8 (GPQA)
Released:Sep 2025
Price:$0.60/1M tokens
Kimi K2 Base
Moonshot AI
1.0T
Best score:0.9 (MMLU)
Released:Jan 2025
MiMo-V2-Flash
Xiaomi
309.0B
Best score:0.8 (GPQA)
Released:Dec 2025
Command R+
Cohere
104.0B
Best score:0.8 (MMLU)
Released:Aug 2024
Price:$0.25/1M tokens
GLM-4.7
Zhipu AI
358.0B
Best score:0.9 (TAU)
Released:Dec 2025
Price:$0.60/1M tokens
LongCat-Flash-Chat
Meituan
560.0B
Best score:0.9 (MMLU)
Released:Aug 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.