Google logo

Gemma 2 27B

Google

Gemma 2 27B IT is a version of Google's state-of-the-art open language model fine-tuned for instruction following. Built on the same research and technology as Gemini, it is optimized for conversational applications through supervised fine-tuning, distillation from larger models, and RLHF. The model excels at text generation tasks including question answering, summarization, and reasoning.

Key Specifications

Parameters
27.2B
Context
-
Release Date
June 27, 2024
Average Score
69.1%

Timeline

Key dates in the model's history
Announcement
June 27, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
27.2B
Training Tokens
13.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
10-shot This method 10-shot (10-) When 10-shot we we provide model 10 samples training, in order to show, how solution for similar tasks. For example, for analysis mathematical tasks we we can provide 10 examples mathematical tasks with solutions. So model receives context that, how should and structure solution. This technique especially useful, when: • We in order to model format answer • need to show reasoning for complex tasks • Task has structure, which better 10-shot is tool, since provides model context, but also requires number tokens, that can for tasks and answerSelf-reported
86.4%
MMLU
5-shot, top-1 AI: I answer on query in format 5-shot, top-1. For this: 1. 5 independent answers on question 2. these 5 answers 1 best 3. only this best answer in capacity solutions This method allows me: - several possible approaches to compare different solutions - solution with only one solution without intermediate reasoning Method based on research, that generation several solutions and choice best from them can significantly improve quality answers for complex tasksSelf-reported
75.2%
Winogrande
5-shotSelf-reported
83.7%

Programming

Programming skills tests
HumanEval
pass@1 with first attempts AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. We define the "pass@1" metric as the probability that a model will solve a problem on a single attempt. This is the most realistic measure of performance for real-world applications where users typically expect correct answers on the first try. Pass@1 is computed by evaluating a model on a single attempt at each problem in a benchmark and measuring the fraction of problems solved correctly. Unlike metrics that allow multiple attempts (e.g., pass@k, which measures whether any of k samples contains the correct answer), pass@1 does not permit the model to generate multiple outputs and select the best one. This better simulates real-world use cases where users receive a single responseSelf-reported
51.8%
MBPP
3-shotSelf-reported
62.6%

Mathematics

Mathematical problems and computations
GSM8k
5-shot, maj@1 In this work we we use 5-shot, maj@1 how one from main methods evaluation. When 5-shot, maj@1 each example is evaluated five times, and we most answer. This that for each question we five answers and then we use mode answers (most often occurring answer) in capacity model. In case we we choose first from answers, in In mainly we we use this method for tests with multiple choice, where most common choice is results. We discovered, that such reliability our evaluations on several points by comparison with evaluation (1-shot)Self-reported
74.0%
MATH
4 examples AI: 4-shot evaluation is a framework for understanding how well a large language model (LLM) solves problems. In this method, the LLM sees four examples of a particular problem, each with a solution, before being asked to solve a similar problem. The method is designed to test the model's ability to: 1. Recognize patterns from the example problems 2. Extract a general problem-solving approach 3. Apply that approach to a new problem instance This framework helps us understand the in-context learning capabilities of modern LLMs. It differs from zero-shot (no examples) and few-shot (1-3 examples) by providing enough examples to establish a robust pattern, while still testing the model's ability to generalize rather than memorize. 4-shot evaluation is particularly useful for assessing performance on: - Mathematical reasoning - Coding challenges - Logical puzzles - Rule-based games When designing 4-shot evaluations, researchers carefully select examples that: - Cover key aspects of the problem domain - Demonstrate the correct solution process - Vary in their specifics to encourage abstraction This approach provides a standardized way to compare different models' problem-solving abilities and has become a common benchmark in AI evaluationSelf-reported
42.3%

Other Tests

Specialized benchmarks
AGIEval
3-5-shot This methodology examples by number additional examples, in prompt. example has three additional example, and — five. Examples can in dependency from benchmark, but in cases examples represent itself tasks that indeed type, that and task, which can in set. For example, tasks GPQA with examples — this questions and answers that indeed format, but not by that indeed that and question, on which need to answer. Usually these examples from questions that indeed level complexity and with indeed answers (/no, choice and etc.etc.), that and question. This especially useful for evaluation model in format "few-shot learning", that important for applications, where model should quickly adapt to new tasks with number examplesSelf-reported
55.1%
ARC-C
25-shotSelf-reported
71.4%
ARC-E
## 0-shot In this method model is evaluated only on basis her/its answers on tasks, without any-or examples or additional context. We model solve problem, evaluating only quality her/its answer, without provision it examples solutions tasks. Strong side: This way verification abilities model, most to that, how she/it can in real conditions. 0-shot testing also is abilities model, since she/it not receives help in form examples. side: Model can understand task not so, how or not specific format, in which should be presented answerSelf-reported
88.6%
BIG-Bench
Method "3-shot, CoT" (3-with chain reasoning) based on concepts output model several examples, detailed steps solutions — so chain reasoning (Chain-of-Thought). Such approach allows model and structure logical reasoning, for solutions complex tasks. In this method are used three (3) thoroughly example, process step-by-step solutions tasks. Each example reasoning, course thoughts for achievements answer. This helps model on mode thinking and apply approach to new task. Method especially efficient for mathematical, logical and tasks, where critically important reasoning. approach in that, that he not requires additional training model, and works exclusively on level promptSelf-reported
74.9%
BoolQ
In large language models (LLM) evaluation complex task. Although various benchmarks for measurement different aspects performance LLM, they usually not give representations about capabilities models and often not nuances in that, how differences in capabilities in real scenarios use. For evaluation model in various scenarios use we approach with on evaluation. Instead that in order to evaluate model on basis we provide more understanding various aspects performance model and behavior model through diverse scenarios use and queries. we: - model in and tasks - we use analysis reasoning and output model - we compare performance between models - we evaluate, how model processes prompt (jailbreak) This approach not only provides more representation about capabilities model, but and helps identify and field for improvementsSelf-reported
84.8%
Natural Questions
5-shotSelf-reported
34.5%
PIQA
0-shot question or assignment for model without any-or additional examples or instructions. Model only on its knowledge and training for creation answer. Example: "process " This method evaluates ability model use its basic knowledge without additional context. Advantages: - application - real scenarios use - basic knowledge without prompts Limitations: - Not accounts for capabilities model to training on examples - Can give results for complex or tasks - Not allows models its answer under specific formatSelf-reported
83.2%
Social IQa
attemptSelf-reported
53.7%
TriviaQA
5-shotSelf-reported
83.7%

License & Metadata

License
gemma
Announcement Date
June 27, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.