Gemma 2 9B

Name: Gemma 2 9B
Author: Google

Google

Gemma 2 9B IT is a version of Google's base Gemma 2 9B model fine-tuned for instruction following. The model was trained on 8 trillion tokens of web data, code, and mathematical content. It uses sliding window attention, logit capping, and knowledge distillation methods. It is optimized for conversational applications through supervised fine-tuning, distillation, RLHF, and model merging using WARP.

Key Specifications

Parameters

9.2B

Context

Release Date

June 27, 2024

Average Score

64.6%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

June 27, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

9.2B

Training Tokens

8.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

10-shot evaluation AI • Self-reported

81.9%

MMLU

5-shot evaluation This method evaluation model, at which it are provided five examples queries and answers before that, how she/it question. These examples demonstrate format answer and can reasoning, although usually they not directly to testUse 5-shot evaluation allows model better understand tasks and format answer, than in case 0-shot evaluation (when model receives only instructions, but not examples). Such method evaluation especially useful for tests, so how he real in which people often with several examples, before than to test. This gives models context that, answers and can help them correctly structure its reasoning • Self-reported

71.3%

Winogrande

evaluation points • Self-reported

80.6%

Programming

Programming skills tests

HumanEval

with first attempts AI: pass@1 AI: pass@1 • Self-reported

40.2%

MBPP

3-shot evaluation AI Chatbot: I can't give you a timestamp or access to when Anthropic was founded. I'm an AI assistant made by Anthropic to be helpful, harmless, and honest, but I don't have access to real-time information about my company's founding date. I also don't have access to today's date. If you need this specific information, you'd need to look it up from a reliable source. • Self-reported

52.4%

Mathematics

Mathematical problems and computations

GSM8k

5-shot majority@1 AI: 5 In task on execution or solutions model LLM receives task and 5 demonstrations (explanations tasks and answer). Then she/it generates answer to and we we evaluate accuracy this answer. this important: This method evaluates ability model quickly on examples (few-shot learning), that is important model. Advantages: - real scenarios use, where users provide several examples together with query - Allows evaluate, how well well model can patterns from examples - in evaluation, using examples Disadvantages: - Performance can from quality selected examples - Not verifies efficiency model at examples (zero-shot learning) - When some tasks 5 examples can be for identification complex • Self-reported

68.6%

MATH

4-shot evaluation AI: [query ] Human: strong and weak side this answer. its from 1 to 10. AI: [text 1-evaluation] Human: this answer still times, but AI: [text 2-evaluation] Human: this answer still times with points view for user. AI: [text 3-evaluation] Human: evaluation answer, considering, how well he useful and AI: [text 4-evaluation] • Self-reported

36.6%

Other Tests

Specialized benchmarks

AGIEval

3-5-shot evaluation Method evaluation model, at which we we provide model 3-5 examples (together with answers) that type tasks, which we evaluate, before than question for evaluation. This method helps model understand format, in which we answer, and often leads to results, than provision only instructions or examples questions without answers. He especially useful for complex tasks, where format answer not or where task requires specific thinking • Self-reported

52.8%

ARC-C

25-shot evaluation AI: 25-shot evaluation refers to the method of selecting 25 random examples from a test set for an LLM to run inference on. This is a smaller sample of examples, allowing for a faster but still representative evaluation. Often, the 25 selected examples are then analyzed in greater depth than a full test set, potentially with human evaluation of outputs. • Self-reported

68.4%

ARC-E

0-shot evaluation AI What is LLM? This is a form of zero-shot evaluation, which is often used by AI developers to assess a model's performance on a task without first providing it with examples of that task. For an LLM, this means providing it with a prompt that asks it to perform some task but does not include examples of the expected output format. 0-shot evaluation is used to test how well a model can understand and execute a task based only on natural language instructions. For instance, asking a model to "Write a poem about a sunset" without showing it any poems first is a 0-shot evaluation. This approach helps assess how well models can generalize from their training data and follow instructions without additional guidance. It contrasts with few-shot evaluation, where examples are provided in the prompt. • Self-reported

88.0%

BIG-Bench

3-shot Chain-of-Thought Method 3-shot Chain-of-Thought (CoT) base few-shot prompting, step-by-step reasoning. Instead that in order to simply provide examples data and answers, each example also includes in itself sequence intermediate steps reasoning, which to answer. : 1. three example tasks, each from which includes: - task - reasoning, how task answer 2. After examples is provided task, which model should solve, using method reasoning. Advantages: - model its reasoning, that improves performance in complex tasks - process model, it errors and Allows model break down complex tasks on more managed subtasks Limitations: - examples CoT requires time and knowledge - Performance can from selected examples and their for target tasks - Examples can specific templates reasoning, which can not for all cases in given tasks • Self-reported

68.2%

BoolQ

0-shot evaluation In approach 0-shot to evaluation we model assignment without additional examples or instructions. Model should rely exclusively on its knowledge and abilities for execution assignments. This most simple format testing, which also most reflects use AI-models in world, where users provide set examples. 0-shot evaluation measures basic abilities model in conditions results in such tests indicate on knowledge and high ability to generalization. However note, that 0-shot approach can capabilities model, which can significantly improve its performance at several examples or more instructions • Self-reported

84.2%

Natural Questions

5-shot evaluation AI • Self-reported

29.2%

PIQA

0-shot evaluation AI: • Self-reported

81.7%

Social IQa

0-shot evaluation AI: In 0-shot evaluation model performs task directly, not capabilities adapt to tasks or testset. This most method evaluation abilities model. We we use following format query: $Task Example 0-shot query: "Solve following task AIME: a_n how a_1 = 1, a_2 = 2, a_3 = 3, and a_n = a_{n-1} + a_{n-3} for n ≥ 4. a_{2023}." In 0-shot mode we not model, in order to she/it its answer manner. This gives most representation about model solve tasks and often is complex mode for model • Self-reported

53.4%

TriviaQA

5-shot evaluation AI: I am going to solve this step-by-step, clearly explaining my reasoning at each point. Human judge: I will evaluate the AI's solution on the following 5 aspects: 1. Correctness: Is the final answer correct? 2. Reasoning: Does the AI's reasoning process make sense and avoid logical errors? 3. Clarity: Is the explanation clear and easy to follow? 4. Efficiency: Does the AI solve the problem in an efficient way? 5. Completeness: Does the AI address all parts of the question? For each aspect, I'll give a score from 1-5: 1: Poor 2: Fair 3: Good 4: Very Good 5: Excellent The human judge provides an overall assessment after evaluating all five aspects. • Self-reported

76.6%