Gemma 2 27B
Gemma 2 27B IT is a version of Google's state-of-the-art open language model fine-tuned for instruction following. Built on the same research and technology as Gemini, it is optimized for conversational applications through supervised fine-tuning, distillation from larger models, and RLHF. The model excels at text generation tasks including question answering, summarization, and reasoning.
Key Specifications
Parameters
27.2B
Context
-
Release Date
June 27, 2024
Average Score
69.1%
Timeline
Key dates in the model's history
Announcement
June 27, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
27.2B
Training Tokens
13.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
10-shot This method 10-shot (10-) When 10-shot we we provide model 10 samples training, in order to show, how solution for similar tasks. For example, for analysis mathematical tasks we we can provide 10 examples mathematical tasks with solutions. So model receives context that, how should and structure solution. This technique especially useful, when: • We in order to model format answer • need to show reasoning for complex tasks • Task has structure, which better 10-shot is tool, since provides model context, but also requires number tokens, that can for tasks and answer • Self-reported
MMLU
5-shot, top-1 AI: I answer on query in format 5-shot, top-1. For this: 1. 5 independent answers on question 2. these 5 answers 1 best 3. only this best answer in capacity solutions This method allows me: - several possible approaches to compare different solutions - solution with only one solution without intermediate reasoning Method based on research, that generation several solutions and choice best from them can significantly improve quality answers for complex tasks • Self-reported
Winogrande
5-shot • Self-reported
Programming
Programming skills tests
HumanEval
pass@1 with first attempts AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. We define the "pass@1" metric as the probability that a model will solve a problem on a single attempt. This is the most realistic measure of performance for real-world applications where users typically expect correct answers on the first try. Pass@1 is computed by evaluating a model on a single attempt at each problem in a benchmark and measuring the fraction of problems solved correctly. Unlike metrics that allow multiple attempts (e.g., pass@k, which measures whether any of k samples contains the correct answer), pass@1 does not permit the model to generate multiple outputs and select the best one. This better simulates real-world use cases where users receive a single response • Self-reported
MBPP
3-shot • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
5-shot, maj@1 In this work we we use 5-shot, maj@1 how one from main methods evaluation. When 5-shot, maj@1 each example is evaluated five times, and we most answer. This that for each question we five answers and then we use mode answers (most often occurring answer) in capacity model. In case we we choose first from answers, in In mainly we we use this method for tests with multiple choice, where most common choice is results. We discovered, that such reliability our evaluations on several points by comparison with evaluation (1-shot) • Self-reported
MATH
4 examples AI: 4-shot evaluation is a framework for understanding how well a large language model (LLM) solves problems. In this method, the LLM sees four examples of a particular problem, each with a solution, before being asked to solve a similar problem. The method is designed to test the model's ability to: 1. Recognize patterns from the example problems 2. Extract a general problem-solving approach 3. Apply that approach to a new problem instance This framework helps us understand the in-context learning capabilities of modern LLMs. It differs from zero-shot (no examples) and few-shot (1-3 examples) by providing enough examples to establish a robust pattern, while still testing the model's ability to generalize rather than memorize. 4-shot evaluation is particularly useful for assessing performance on: - Mathematical reasoning - Coding challenges - Logical puzzles - Rule-based games When designing 4-shot evaluations, researchers carefully select examples that: - Cover key aspects of the problem domain - Demonstrate the correct solution process - Vary in their specifics to encourage abstraction This approach provides a standardized way to compare different models' problem-solving abilities and has become a common benchmark in AI evaluation • Self-reported
Other Tests
Specialized benchmarks
AGIEval
3-5-shot This methodology examples by number additional examples, in prompt. example has three additional example, and — five. Examples can in dependency from benchmark, but in cases examples represent itself tasks that indeed type, that and task, which can in set. For example, tasks GPQA with examples — this questions and answers that indeed format, but not by that indeed that and question, on which need to answer. Usually these examples from questions that indeed level complexity and with indeed answers (/no, choice and etc.etc.), that and question. This especially useful for evaluation model in format "few-shot learning", that important for applications, where model should quickly adapt to new tasks with number examples • Self-reported
ARC-C
25-shot • Self-reported
ARC-E
## 0-shot In this method model is evaluated only on basis her/its answers on tasks, without any-or examples or additional context. We model solve problem, evaluating only quality her/its answer, without provision it examples solutions tasks. Strong side: This way verification abilities model, most to that, how she/it can in real conditions. 0-shot testing also is abilities model, since she/it not receives help in form examples. side: Model can understand task not so, how or not specific format, in which should be presented answer • Self-reported
BIG-Bench
Method "3-shot, CoT" (3-with chain reasoning) based on concepts output model several examples, detailed steps solutions — so chain reasoning (Chain-of-Thought). Such approach allows model and structure logical reasoning, for solutions complex tasks. In this method are used three (3) thoroughly example, process step-by-step solutions tasks. Each example reasoning, course thoughts for achievements answer. This helps model on mode thinking and apply approach to new task. Method especially efficient for mathematical, logical and tasks, where critically important reasoning. approach in that, that he not requires additional training model, and works exclusively on level prompt • Self-reported
BoolQ
In large language models (LLM) evaluation complex task. Although various benchmarks for measurement different aspects performance LLM, they usually not give representations about capabilities models and often not nuances in that, how differences in capabilities in real scenarios use. For evaluation model in various scenarios use we approach with on evaluation. Instead that in order to evaluate model on basis we provide more understanding various aspects performance model and behavior model through diverse scenarios use and queries. we: - model in and tasks - we use analysis reasoning and output model - we compare performance between models - we evaluate, how model processes prompt (jailbreak) This approach not only provides more representation about capabilities model, but and helps identify and field for improvements • Self-reported
Natural Questions
5-shot • Self-reported
PIQA
0-shot question or assignment for model without any-or additional examples or instructions. Model only on its knowledge and training for creation answer. Example: "process " This method evaluates ability model use its basic knowledge without additional context. Advantages: - application - real scenarios use - basic knowledge without prompts Limitations: - Not accounts for capabilities model to training on examples - Can give results for complex or tasks - Not allows models its answer under specific format • Self-reported
Social IQa
attempt • Self-reported
TriviaQA
5-shot • Self-reported
License & Metadata
License
gemma
Announcement Date
June 27, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsGemma 2 9B
9.2B
Best score:0.7 (MMLU)
Released:Jun 2024
Gemini Diffusion
Best score:0.9 (HumanEval)
Released:May 2025
Gemma 3 27B
MM27.0B
Best score:0.9 (HumanEval)
Released:Mar 2025
Price:$0.11/1M tokens
Gemma 3 12B
MM12.0B
Best score:0.9 (HumanEval)
Released:Mar 2025
Price:$0.05/1M tokens
Llama-3.3 Nemotron Super 49B v1
NVIDIA
49.9B
Best score:0.7 (GPQA)
Released:Mar 2025
Phi-3.5-MoE-instruct
Microsoft
60.0B
Best score:0.9 (ARC)
Released:Aug 2024
Mistral NeMo Instruct
Mistral AI
12.0B
Best score:0.7 (MMLU)
Released:Jul 2024
Price:$0.15/1M tokens
Magistral Small 2506
Mistral AI
24.0B
Best score:0.7 (GPQA)
Released:Jun 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.