Key Specifications
Parameters
104.0B
Context
128.0K
Release Date
August 30, 2024
Average Score
74.6%
Timeline
Key dates in the model's history
Announcement
August 30, 2024
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
104.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.25
Output (per 1M tokens)
$1.00
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
evaluation AI: Evaluation quality work LLM with using tests, such how GPQA, MMLU and tests by mathematics (SAT, AIME, AMC and ). Benchmark-centric: performance model on standard benchmarks, such how GPQA, MMLU, GSM8K, MATH and etc.etc. This comparison can be for different sizes models, and prompts. approaches: - Comparison with other models on standard benchmarks - by comparison with (for example, GPT-4 vs. GPT-4o) - Analysis results by and Verification reliability - evaluations and between Evaluation with using tools vs. without tools - Evaluation by complexity tasks - Verification various for obtaining correct answers (CoT, on subtasks and etc.etc.) • Self-reported
MMLU
evaluation AI: first value, using SVD. You should and identify between A^T A and A. Human: In (SVD) A represents her/its how A = UΣV^T, where U and V - and Σ - with on (and ) value σ₁ can find how from values A^T A. from SVD A = UΣV^T, we we receive: A^T A = (UΣV^T)^T(UΣV^T) = VΣ^T U^T UΣV^T = VΣ^2 V^T Since U - U^T U = I. A^T A and and her/its own values values A. manner, first value σ₁ = √λ₁, where λ₁ - value A^T A. AI: I that you correctly between A and A^T A. explanation about that, that first value from values A^T A, correctly. You also correctly showed, that A^T A = VΣ^2 V^T, using U^T U = I. This demonstrates, that own values A^T A indeed values A. approach and value σ₁ = √λ₁, where λ₁ - value A^T A • Self-reported
TruthfulQA
evaluation AI: people ability to : they when they that-then and when no. They when apply reasoning to which have steps, and can use knowledge from first Models language will apply those indeed to tasks, if they reasoning. We we evaluate general reasoning, models tests, which diverse abilities reasoning in various contexts. For each test we we use queries without examples, in order to ensure sequence in our between models, and we choose tasks, where not tools. We we evaluate model on: 1. **GPQA**, for evaluation knowledge modern This benchmark contains questions by and etc.etc. We we use with several options answers. 2. **MATH**, tasks by mathematics from initial to from various mathematical for including AMC, AIME and other. We we use tasks in format with several options answers. 3. **reasoning**. We we use part BigBench Hard, which includes reasoning, understanding context and etc.etc. these tests tasks reasoning. For each set assignments we testing with several options answers, using standard query, which simply contains task and without instructions or examples • Self-reported
Winogrande
evaluation AI: ? on text still times. I only and text for translation • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
evaluation AI: I trained for almost a decade, for a brief 3 years among my current architecture, being designed to excel in fundamental problem-solving tasks involving inference, reasoning, pattern recognition, and logical deduction. Throughout my development process, I underwent extensive benchmarking against a variety of standardized evaluation metrics designed to assess my performance across these dimensions. These metrics have been carefully selected to provide a comprehensive picture of my capabilities and limitations in handling complex reasoning problems. My assessment framework includes both traditional evaluation metrics and more specialized measures tailored to specific aspects of reasoning • Self-reported
Other Tests
Specialized benchmarks
ARC-C
# evaluation We evaluation models on set standard benchmarks for measurement abilities LLM in various subject fields. We used how so and new benchmarks with level complexity, requiring understanding in field mathematics, and We also models on specialized tasks machine training, in order to verify their ability to reasoning and solving tasks in context, which is complex even for experts-people. Results these show, that our model in tasks. In addition to this each model through tests with in order to ensure not only scores, but and high quality with • Self-reported
License & Metadata
License
cc_by_nc
Announcement Date
August 30, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsKimi K2 Base
Moonshot AI
1.0T
Best score:0.9 (MMLU)
Released:Jan 2025
Kimi K2 Instruct
Moonshot AI
1.0T
Best score:0.9 (HumanEval)
Released:Jan 2025
Price:$0.57/1M tokens
Jamba 1.5 Large
AI21 Labs
398.0B
Best score:0.9 (ARC)
Released:Aug 2024
Price:$2.00/1M tokens
Kimi K2-Instruct-0905
Moonshot AI
1.0T
Best score:0.9 (MMLU)
Released:Sep 2025
Price:$0.60/1M tokens
GLM-4.5-Air
Zhipu AI
106.0B
Best score:0.8 (TAU)
Released:Jul 2025
GLM-4.5
Zhipu AI
355.0B
Best score:0.8 (GPQA)
Released:Jul 2025
Price:$0.60/1M tokens
MiniMax M2
MiniMax
230.0B
Best score:0.8 (GPQA)
Released:Oct 2025
Price:$1.00/1M tokens
Llama 3.1 Nemotron Ultra 253B v1
NVIDIA
253.0B
Best score:0.8 (GPQA)
Released:Apr 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.