Alibaba logo

Qwen2.5 14B Instruct

Alibaba

Qwen2.5-14B-Instruct is a 14.7 billion parameter language model fine-tuned for instruction following, part of the Qwen2.5 series. It demonstrates significant improvements in instruction following, long text generation (8K+ tokens), structured data understanding, and JSON output generation. The model supports a 128K token context window and multilingual capabilities for over 29 languages, including Chinese, English, French, Spanish, and others.

Key Specifications

Parameters
14.7B
Context
-
Release Date
September 19, 2024
Average Score
70.0%

Timeline

Key dates in the model's history
Announcement
September 19, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
14.7B
Training Tokens
18.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
MMLU - evaluation efficiency AI: GPT-4o In this evaluation, the LLM is tasked with answering multiple-choice questions from the Massive Multitask Language Understanding (MMLU) benchmark. These questions span various domains including humanities, STEM, social sciences, and others. The LLM is instructed to select an answer from provided options. Method: We sample 100 questions from the MMLU test set. We ask the LLM to answer each question in a zero-shot setting, with the following prompt format: {Question} A. {Option A} B. {Option B} C. {Option C} D. {Option D} Answer: For each question, we extract the letter corresponding to the model's answer (A, B, C, or D) and compare it with the ground truth answer.Self-reported
79.7%
TruthfulQA
Evaluation with using benchmark TruthfulQA TruthfulQA evaluates answers model on questions, on which people often incorrectly from-for or We we use metric MC1, which how well that model correct answer from options at Value in from 0 to 1, where 1 represents itself metrics MC2 measures, how well often model how correct, so and incorrect answers, therefore MC2 evaluates and answersSelf-reported
58.4%

Programming

Programming skills tests
HumanEval
Evaluation by benchmark HumanEval AI: Translate: By leveraging the innovative functionality of our prompt wrapper, we've substantially enhanced Claude's performance in coding tasks. When addressing HumanEval challenges, Claude 3 Opus now demonstrates an impressive 90.2% pass@1 performance — a significant improvement over the 75.0% baseline reported for unaugmented Claude 3 Opus on the same benchmark. This remarkable enhancement in problem-solving capability is achieved through our specialized prompt engineering approach, which employs a structured problem-solving framework. This framework guides the model through key steps: 1. Careful problem analysis 2. Systematic planning 3. Methodical code implementation 4. Thorough verification and testing For each HumanEval problem, Claude first thoroughly analyzes the requirements, develops a clear solution strategy, implements code with attention to edge cases, and then rigorously tests the implementation. This deliberate process significantly reduces errors and improves the quality of solutions. The 90.2% pass@1 rate places augmented Claude 3 Opus among the highest-performing models on coding tasks, demonstrating how effective prompt engineering can substantially enhance the capabilities of foundation models without requiring any changes to the underlying model architecture or training.Self-reported
83.5%
MBPP
Evaluation by MBPP MBPP (Mostly Basic Python Programming) — this set from 1000 tasks by programming, which require from model functions on Python for solutions problems. In difference from HumanEval, MBPP includes tests directly in description tasks. For evaluation model by MBPP we we analyze performance on from 500 tasks MBPP. For each tasks model should generate function on Python, which performs task and passes tests. correctly solved tasks determines evaluation pass@1Self-reported
82.0%

Mathematics

Mathematical problems and computations
GSM8k
Evaluation by benchmark GSM8K AI: Genie I'll solve these math problems by identifying the key information, planning my approach, tracking variables, and verifying my final answer. I'll work step-by-step, showing all calculations clearly. 1. First, I'll understand what the problem is asking and identify the given information. 2. Then I'll develop a solution strategy, breaking down complex problems into manageable steps. 3. I'll carefully track all variables and intermediate results. 4. Finally, I'll calculate the answer and double-check my work. For multi-step problems, I'll be especially careful to track how each step leads to the next, and ensure I'm answering the specific question asked.Self-reported
94.8%
MATH
# Evaluation by benchmark MATH Models are evaluated by benchmark MATH, Hendrycks et al. (2021), which consists from 5000 tasks mathematical with This benchmark includes tasks by numbers and from various mathematical competitions and resources. Each task answer, and also solution. complexity tasks from 1 (most simple) to 5 (most complex). We we use indeed evaluation, that and in previous research. Model generates solution for each tasks, and then from solutions or answer. Answer is considered correct, if he exactly matches answer. steps reasoning for correct solutions mathematical tasks, model give step-by-step In this work we also about more besides general accuracy, including by complexity, problems andSelf-reported
80.0%

Reasoning

Logical reasoning and analysis
GPQA
GPQA: evaluation test AI: GPT-4 (gpt-4-0613) In this study, we evaluate the model on challenging questions from GPQA, a high-school-level benchmark drawing from domains such as medicine, STEM, social sciences, humanities, and business. All questions are formatted as multiple-choice questions with 4-5 options. The questions are designed to be challenging for frontier models and typically require deep domain expertise to answer correctly. For each question, we instruct the model to solve the problem step-by-step, reasoning carefully before selecting a final answer. We directly extract the final answer from the model's completion and compare it to the ground truth answerSelf-reported
45.5%

Other Tests

Specialized benchmarks
ARC-C
Evaluation benchmark ARC-C We performance Claude 3 Opus on scientific tasks with help benchmark ARC-C (AI2 Reasoning Challenge, version Challenge). This benchmark contains 1172 question with several options answers, for 3-9 Questions were so, in order to be complex for models artificial intelligence, on in text. ARC-C how scientific knowledge, so and ability to reasoning, including questions, requiring logical chains, application scientific concepts to and comparison or several These tasks how even at and requiring "reasoning, and not simply "Self-reported
67.3%
BBH
BBH - this set tests, Google DeepMind for evaluation abilities AI solve complex tasks "", including solution mathematical tasks, reasoning about situations, application scientific knowledge and use common (sense) meaning. In process evaluation by BBH, model LLM on 23 various tasks. Performance each model by means of tasks, which she/it solves correctly. BBH often is used for evaluation abilities model to reasoning and general abilities solve tasks, since includes set diverse complex tasks, requiring various types thinking. This benchmark for determination capabilities AI in solving tasks, and not simply intelligence on levelSelf-reported
78.2%
HumanEval+
Evaluation by benchmark HumanEval+ AI: analysis about that, how Rho performs functions solutions tasks coding by HumanEval+, version benchmark HumanEval. HumanEval+ tasks for verification understanding coding in scenarios, such how and code. : 1. 30 solutions from set HumanEval+ 2. For each solutions: - analysis coding - efficiency (and complexity) - cases - and We also Rho with other models, how GPT-4, Claude and Gemini, using scoresSelf-reported
51.2%
MBPP+
MBPP+ benchmark-evaluation AI: Method obtaining values AIME, AMC and GPQA for model. how by mathematics and programming, which evaluates ability model solve tasks from various benchmarks. MBPP+ this version benchmark MBPP (Mostly Basic Python Programming), for testing abilities model generate code Python, MBPP contains 974 tasks, which from descriptions tasks on language, solutions and three examples. For correct solutions tasks model should create function, which passes all examples. In process evaluation we we provide model description tasks and examples /data. Model should on Python, which matches and passes tests. After obtaining solutions we automatically we verify its on examples /output. Final score represents itself percentage correctly solved tasks. : models should and without errors. Tasks MBPP+ than in MBPP, therefore you need to be especially at solving these tasks, in order to that answers correctly work with examplesSelf-reported
63.2%
MMLU-Pro
Evaluation by benchmark MMLU-Pro AI: Translate this information: MMLU-Pro is an extension to the popular MMLU (Massive Multitask Language Understanding) benchmark. It includes many more difficult questions than MMLU in 57 subjects, including law, STEM, humanities, social sciences, and more. This makes it a more comprehensive test of advanced reasoning capabilities.Self-reported
63.7%
MMLU-Redux
Evaluation by benchmark MMLU-redux AI: I this text, using and all rules. Evaluation by benchmark MMLU-reduxSelf-reported
80.0%
MMLU-STEM
Evaluation by benchmark MMLU-STEM AI: For evaluating your model on the MMLU-STEM benchmark, we will be using the following test procedure: 1. We will test your model on a subset of the STEM categories from the MMLU benchmark: abstract algebra, anatomy, astronomy, college biology, college chemistry, college computer science, college mathematics, college physics, college medicine, computer security, conceptual physics, electrical engineering, elementary mathematics, formal logic, high school biology, high school chemistry, high school computer science, high school mathematics, high school physics, high school statistics, machine learning, physics, and virology. 2. The test will consist of multiple-choice questions where your model will need to select from options A, B, C, or D. 3. For the evaluation, we will use a standard "few-shot" approach. Your model will be provided with 5 examples from the same subject as context before answering a new question. 4. The scoring will be done simply as the percentage of correct answers across all questions in the test set. 5. We will compare your model's performance against published results for other large language models on the same benchmark. 6. The test will be conducted in a controlled environment without access to external tools or the internet. Please ensure your model is properly calibrated for multiple-choice question answering before the evaluation.Self-reported
76.4%
MultiPL-E
Evaluation on benchmark MultiPL-E AI: MultiPL-E is a benchmark for the evaluation of code generation across multiple programming languages. The benchmark is derived from HumanEval and it includes 164 hand-written programming problems with a function signature, docstring, body, and several unit tests. The task is to generate the full function body from the signature and the docstring. The benchmark evaluates the model across 18 programming languages: C++, C#, D, Go, Java, JavaScript, Julia, Kotlin, Lua, PHP, Perl, Python, R, Ruby, Rust, Scala, Swift, and TypeScript.Self-reported
72.8%
TheoremQA
TheoremQA - evaluation efficiency AI: I will translate the given text about the TheoremQA benchmark evaluation. TheoremQA - evaluation efficiencySelf-reported
43.0%

License & Metadata

License
apache_2_0
Announcement Date
September 19, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.