Gemma 2 27B

Name: Gemma 2 27B
Author: Google

Google

Gemma 2 27B IT is a version of Google's state-of-the-art open language model fine-tuned for instruction following. Built on the same research and technology as Gemini, it is optimized for conversational applications through supervised fine-tuning, distillation from larger models, and RLHF. The model excels at text generation tasks including question answering, summarization, and reasoning.

Key Specifications

Parameters

27.2B

Context

Release Date

June 27, 2024

Average Score

69.1%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

June 27, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

27.2B

Training Tokens

13.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

10-shot This method 10-shot (10-) When 10-shot we we provide model 10 samples training, in order to show, how solution for similar tasks. For example, for analysis mathematical tasks we we can provide 10 examples mathematical tasks with solutions. So model receives context that, how should and structure solution. This technique especially useful, when: • We in order to model format answer • need to show reasoning for complex tasks • Task has structure, which better 10-shot is tool, since provides model context, but also requires number tokens, that can for tasks and answer • Self-reported

86.4%

MMLU

5-shot, top-1 AI: I answer on query in format 5-shot, top-1. For this: 1. 5 independent answers on question 2. these 5 answers 1 best 3. only this best answer in capacity solutions This method allows me: - several possible approaches to compare different solutions - solution with only one solution without intermediate reasoning Method based on research, that generation several solutions and choice best from them can significantly improve quality answers for complex tasks • Self-reported

75.2%

Winogrande

5-shot • Self-reported

83.7%

Programming

Programming skills tests

HumanEval

pass@1 with first attempts AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. We define the "pass@1" metric as the probability that a model will solve a problem on a single attempt. This is the most realistic measure of performance for real-world applications where users typically expect correct answers on the first try. Pass@1 is computed by evaluating a model on a single attempt at each problem in a benchmark and measuring the fraction of problems solved correctly. Unlike metrics that allow multiple attempts (e.g., pass@k, which measures whether any of k samples contains the correct answer), pass@1 does not permit the model to generate multiple outputs and select the best one. This better simulates real-world use cases where users receive a single response • Self-reported

51.8%

MBPP

3-shot • Self-reported

62.6%

Mathematics

Mathematical problems and computations

GSM8k

5-shot, maj@1 In this work we we use 5-shot, maj@1 how one from main methods evaluation. When 5-shot, maj@1 each example is evaluated five times, and we most answer. This that for each question we five answers and then we use mode answers (most often occurring answer) in capacity model. In case we we choose first from answers, in In mainly we we use this method for tests with multiple choice, where most common choice is results. We discovered, that such reliability our evaluations on several points by comparison with evaluation (1-shot) • Self-reported

74.0%

MATH

4 examples AI: 4-shot evaluation is a framework for understanding how well a large language model (LLM) solves problems. In this method, the LLM sees four examples of a particular problem, each with a solution, before being asked to solve a similar problem. The method is designed to test the model's ability to: 1. Recognize patterns from the example problems 2. Extract a general problem-solving approach 3. Apply that approach to a new problem instance This framework helps us understand the in-context learning capabilities of modern LLMs. It differs from zero-shot (no examples) and few-shot (1-3 examples) by providing enough examples to establish a robust pattern, while still testing the model's ability to generalize rather than memorize. 4-shot evaluation is particularly useful for assessing performance on: - Mathematical reasoning - Coding challenges - Logical puzzles - Rule-based games When designing 4-shot evaluations, researchers carefully select examples that: - Cover key aspects of the problem domain - Demonstrate the correct solution process - Vary in their specifics to encourage abstraction This approach provides a standardized way to compare different models' problem-solving abilities and has become a common benchmark in AI evaluation • Self-reported

42.3%

Other Tests

Specialized benchmarks

AGIEval

3-5-shot This methodology examples by number additional examples, in prompt. example has three additional example, and — five. Examples can in dependency from benchmark, but in cases examples represent itself tasks that indeed type, that and task, which can in set. For example, tasks GPQA with examples — this questions and answers that indeed format, but not by that indeed that and question, on which need to answer. Usually these examples from questions that indeed level complexity and with indeed answers (/no, choice and etc.etc.), that and question. This especially useful for evaluation model in format "few-shot learning", that important for applications, where model should quickly adapt to new tasks with number examples • Self-reported

55.1%

ARC-C

25-shot • Self-reported

71.4%

ARC-E

## 0-shot In this method model is evaluated only on basis her/its answers on tasks, without any-or examples or additional context. We model solve problem, evaluating only quality her/its answer, without provision it examples solutions tasks. Strong side: This way verification abilities model, most to that, how she/it can in real conditions. 0-shot testing also is abilities model, since she/it not receives help in form examples. side: Model can understand task not so, how or not specific format, in which should be presented answer • Self-reported

88.6%

BIG-Bench

Method "3-shot, CoT" (3-with chain reasoning) based on concepts output model several examples, detailed steps solutions — so chain reasoning (Chain-of-Thought). Such approach allows model and structure logical reasoning, for solutions complex tasks. In this method are used three (3) thoroughly example, process step-by-step solutions tasks. Each example reasoning, course thoughts for achievements answer. This helps model on mode thinking and apply approach to new task. Method especially efficient for mathematical, logical and tasks, where critically important reasoning. approach in that, that he not requires additional training model, and works exclusively on level prompt • Self-reported

74.9%

BoolQ

In large language models (LLM) evaluation complex task. Although various benchmarks for measurement different aspects performance LLM, they usually not give representations about capabilities models and often not nuances in that, how differences in capabilities in real scenarios use. For evaluation model in various scenarios use we approach with on evaluation. Instead that in order to evaluate model on basis we provide more understanding various aspects performance model and behavior model through diverse scenarios use and queries. we: - model in and tasks - we use analysis reasoning and output model - we compare performance between models - we evaluate, how model processes prompt (jailbreak) This approach not only provides more representation about capabilities model, but and helps identify and field for improvements • Self-reported

84.8%

Natural Questions

5-shot • Self-reported

34.5%

PIQA

0-shot question or assignment for model without any-or additional examples or instructions. Model only on its knowledge and training for creation answer. Example: "process " This method evaluates ability model use its basic knowledge without additional context. Advantages: - application - real scenarios use - basic knowledge without prompts Limitations: - Not accounts for capabilities model to training on examples - Can give results for complex or tasks - Not allows models its answer under specific format • Self-reported

83.2%

Social IQa

attempt • Self-reported

53.7%

TriviaQA

5-shot • Self-reported

83.7%

License & Metadata

License

gemma

Announcement Date

June 27, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Gemma 2 9B

Google

9.2B

Best score:0.7 (MMLU)

Released:Jun 2024

Gemini Diffusion

Google

Best score:0.9 (HumanEval)

Released:May 2025

Gemma 3 27B

Google

MM27.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Price:$0.11/1M tokens

Gemma 3 12B

Google

MM12.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Price:$0.05/1M tokens

Llama-3.3 Nemotron Super 49B v1

NVIDIA

49.9B

Best score:0.7 (GPQA)

Released:Mar 2025

Phi-3.5-MoE-instruct

Microsoft

60.0B

Best score:0.9 (ARC)

Released:Aug 2024

Mistral NeMo Instruct

Mistral AI

12.0B

Best score:0.7 (MMLU)

Released:Jul 2024

Price:$0.15/1M tokens

Magistral Small 2506

Mistral AI

24.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.