Kimi K2 Instruct

Name: Kimi K2 Instruct
Author: Moonshot AI

Moonshot AI

Kimi K2 Instruct is the instruction-tuned version of Kimi K2, a Mixture-of-Experts (MoE) language model by Moonshot AI with 1 trillion parameters and 32 billion active per forward pass. Optimized for following instructions, multi-turn conversation, and agentic use cases. Supports context windows up to 128K tokens and excels in coding, reasoning, and tool use tasks.

Key Specifications

Parameters

1.0T

Context

131.1K

Release Date

January 1, 2025

Average Score

66.7%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

January 1, 2025

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

1.0T

Training Tokens

15.5T tokens

Knowledge Cutoff

Family

Fine-tuned from

kimi-k2-base

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.57

Output (per 1M tokens)

$2.30

Max Input Tokens

131.1K

Max Output Tokens

131.1K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Standard evaluation • Self-reported

89.5%

Programming

Programming skills tests

HumanEval

Pass@1 Pass@1 Pass@k, k Pass@1 Pass@1 Pass@1 Pass@1 • Self-reported

93.3%

Mathematics

Mathematical problems and computations

GSM8k

Accuracy AI: I'll compute the accuracy of the model's responses by comparing them to the ground truth answers. For multiple-choice questions, I'll check if the model selected the correct option (A, B, C, or D). For open-ended questions that require numerical answers, I'll check if the model's final answer matches the correct value. I'll be lenient with formatting differences (e.g., "5" vs "5.0" vs "five"). For open-ended questions that require textual answers, I'll assess whether the model's response contains the key elements of the correct answer. I'll report the overall accuracy as the percentage of questions answered correctly, and break down performance by question type and difficulty level. • Self-reported

97.3%

Reasoning

Logical reasoning and analysis

GPQA

Diamond Avg@8 Diamond Avg@8 — 8 Diamond Avg@8 : 1. 2. 3. 8 4. 8 Diamond Avg@8 • Self-reported

75.1%

Other Tests

Specialized benchmarks

AceBench

Accuracy AI: 0 • Self-reported

76.5%

Aider-Polyglot

Accuracy AI: 0.0/1.0 • Self-reported

60.0%

AIME 2024

64 • Self-reported

69.6%

AIME 2025

Avg@64 • Self-reported

49.5%

AutoLogi

Accuracy AI • Self-reported

89.5%

CBNSL

Accuracy AI: "Accuracy" refers to how often a model makes correct predictions or provides correct answers. For simple tasks like "Is this image a cat or a dog?", accuracy is straightforward - the percentage of correct classifications. For complex tasks like answering multi-step math problems or open-ended questions, accuracy becomes more nuanced: 1. Partial correctness may apply (getting part of a multi-step solution right) 2. Multiple valid answers may exist 3. Context and interpretation matter When evaluating large language models, accuracy can be measured through: - Benchmark performance (scores on standardized tests) - Human evaluation (experts judging correctness) - Comparison to reference answers - Self-consistency (agreement across multiple attempts) Improving accuracy typically involves: - More/better training data - Enhanced model architectures - Better fine-tuning techniques - Improved prompting methods High accuracy is critical for high-stakes applications but must be balanced with other considerations like speed, transparency, and resource efficiency. • Self-reported

95.6%

CNMO 2024

16 • Self-reported

74.3%

CSimpleQA

Correct • Self-reported

78.4%

HMMT 2025

Avg@32 AI: * • Self-reported

38.8%

HumanEval-ER

Pass@1 Pass@1 n=k (20 ), evaluation Pass@1, Pass@k k • Self-reported

81.1%

Humanity's Last Exam

Accuracy () • Self-reported

4.7%

IFEval

"", LLM : - "X" - "" - "" • Self-reported

89.8%

LiveBench

Pass@1 — Pass@1 Pass@1 (n=100) • Self-reported

76.4%

LiveCodeBench v6

Pass@1 Pass@1 Pass@1 : 1. 2. 3. () 4. : Pass@1 = () / () Pass@1 Pass@k (), Pass@1 • Self-reported

53.7%

MATH-500

Accuracy AI Accuracy : accuracy - GPT-4o, accuracy, accuracy accuracy GPT-4o • Self-reported

97.4%

MMLU-Pro

EM • Self-reported

81.1%

MMLU-Redux

EM • Self-reported

92.7%

MultiChallenge

Accuracy AI: [model] is a powerful artificial intelligence language model developed by OpenAI. In this test, we assess its accuracy in answering questions correctly. Accuracy refers to the model's ability to provide factually correct responses without making errors or generating false information. To evaluate accuracy, we present the model with questions that have verifiable answers across different domains including science, history, mathematics, and general knowledge. We then compare the model's answers against established facts from reliable sources. Factors affecting accuracy include: 1. Knowledge cutoff limitations 2. Training data quality and comprehensiveness 3. Inherent limitations in pattern recognition 4. Prompt specificity and clarity The accuracy assessment gives us insight into how reliable [model] is as an information source and helps identify areas where additional training or improvements may be needed. • Self-reported

54.1%

MultiPL-E

# Pass@1 Pass@1 - : 1. 2. () 3. Pass@1 - Pass@1 • Self-reported

85.7%

MuSR

Pass@1 AI: Pass@1 Pass@1, 1000 (GSM8K) • Self-reported

76.4%

OJBench

Pass@1 Pass@1 Pass@1 Pass@1 Pass@k (k > 1) • Self-reported

27.1%

PolyMath-en

4 • Self-reported

65.1%

SimpleQA

Standard evaluation • Self-reported

31.0%

SuperGPQA

Accuracy AI: • Self-reported

57.2%

SWE-bench Multilingual

Standard evaluation • Self-reported

47.3%

SWE-bench Verified (Agentic Coding)

Standard evaluation • Self-reported

65.8%

SWE-bench Verified (Agentless)

thinking, : - : : : - : : • Self-reported

51.8%

SWE-bench Verified (Multiple Attempts)

temperature top_p, seed. : 1. : 2. Score : 3. : 4. evaluation: • Self-reported

71.6%

Tau2 airline

Avg@4 Avg@4 4 4 • Self-reported

56.5%

Tau2 retail

4 • Self-reported

70.6%

Tau2 telecom

Avg@4 4 4 • Self-reported

65.8%

Terminal-bench

AI: evaluation LLM • • — • evaluation — evaluation • • • : • • • • accuracy • Self-reported

30.0%

Terminus

Accuracy AI: The ability to correctly predict outputs compared to ground truth. • Self-reported

25.0%

ZebraLogic

Accuracy ChatGPT AI: I'm going to solve this step-by-step. To find the smallest positive integer k such that a^k ≡ 1 (mod n), I need to determine the order of a modulo n. Given: - n = 15 - a = 4 First, I'll check if a is relatively prime to n by computing gcd(a,n) = gcd(4,15). 15 = 4*3 + 3 4 = 3*1 + 1 3 = 1*3 + 0 So gcd(4,15) = 1, which means a and n are relatively prime. Now I'll compute powers of a modulo n: 4^1 ≡ 4 (mod 15) 4^2 ≡ 16 ≡ 1 (mod 15) So 4^2 ≡ 1 (mod 15), which means the smallest positive integer k such that a^k ≡ 1 (mod n) is k = 2. Therefore, k = 2 is the answer. • Self-reported

89.0%

License & Metadata

License

modified_mit_license

Announcement Date

January 1, 2025

Last Updated

July 19, 2025

Articles about Kimi K2 Instruct

Moonshot Confirms Cursor Was Authorized to Use Its Model

Moonshot AI clarifies the Cursor Composer controversy, confirming the partnership was legitimate. Here's what it means for AI model licensing.

March 22, 2026

2 min

Similar Models

All Models

Kimi K2 0905

Moonshot AI

1.0T

Best score:0.9 (HumanEval)

Released:Sep 2025

Price:$0.60/1M tokens

Kimi K2-Instruct-0905

Moonshot AI

1.0T

Best score:0.9 (MMLU)

Released:Sep 2025

Price:$0.60/1M tokens

Kimi K2-Thinking-0905

Moonshot AI

1.0T

Best score:0.8 (GPQA)

Released:Sep 2025

Price:$0.60/1M tokens

Kimi K2 Base

Moonshot AI

1.0T

Best score:0.9 (MMLU)

Released:Jan 2025

MiMo-V2-Flash

Xiaomi

309.0B

Best score:0.8 (GPQA)

Released:Dec 2025

Command R+

Cohere

104.0B

Best score:0.8 (MMLU)

Released:Aug 2024

Price:$0.25/1M tokens

GLM-4.7

Zhipu AI

358.0B

Best score:0.9 (TAU)

Released:Dec 2025

Price:$0.60/1M tokens

LongCat-Flash-Chat

Meituan

560.0B

Best score:0.9 (MMLU)

Released:Aug 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.