Google logo

Gemma 3n E4B Instructed LiteRT Preview

Multimodal
Google

Gemma 3n is a generative AI model optimized for everyday devices such as phones, laptops, and tablets. The model incorporates innovations like Per-Layer Embedding (PLE) parameter caching and the MatFormer model architecture to reduce computational and memory requirements. These models process audio, text, and visual data, though this E4B preview currently supports text and visual input. Gemma is a family of lightweight, state-of-the-art open models from Google, built on the same research and technology used to create Gemini models, and licensed for responsible commercial use.

Key Specifications

Parameters
1.9B
Context
-
Release Date
May 20, 2025
Average Score
50.3%

Timeline

Key dates in the model's history
Announcement
May 20, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
1.9B
Training Tokens
-
Knowledge Cutoff
June 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
10-shot accuracy AI: 0-shot or 1-shot methods often not full language models at evaluation their capabilities reasoning. In this work we we evaluate improvements performance LLM at numbers examples in prompt. Using assignments, requiring reasoning, we that accuracy can with using 10 examples instead 0 or 1. Advantages 10-evaluation: 1. in that, that model not receives from format assignments or type training 2. for models, only through API 3. Allows more exactly measure performance model on tasks reasoning Method: model 10 questions and answers to that, how she/it on target questionSelf-reported
78.6%
MMLU
0-shot accuracySelf-reported
64.9%
Winogrande
Accuracy on 5 examples We we compute accuracy model on using 5-prompts, where in capacity examples are used other questions and answers from that indeed assignments. These manner from set data. These 5-evaluation provide more model, than since they errors in interpretation instructions to with side modelSelf-reported
71.7%

Programming

Programming skills tests
HumanEval
Task with first attempts without examples several benchmarks with choice answer, such how MMLU, evaluation language models. In these benchmarks models are provided questions and several options answer, from which model should choose correct. on these tests usually questions, on which model gives correct answer with first attempts. In cases this score "success with first attempts" (pass@1) exclusively useful, since he to that, how people-tasks. However in other cases, especially in tasks, requiring thinking and reasoning, such approach can not full capabilities model. For example, human, task, can first solution, but then its error and her/its. Indicator pass@1 not accounts for such corrections. In our work we evaluation, how often model can obtain correct answer at examples (0-shot) and with first attempts. This gives more full representation about reliability model in various scenarios, especially in those, where is required reasoningSelf-reported
75.0%
MBPP
3-shot pass@1 In given we we evaluate ability model work with example from three tasks with solutions (3-shot) and then apply those indeed reasoning for solutions new tasks. This test measures: - training: can whether model patterns from several examples - : capable whether model apply approach to new task - : can whether model in its reasoning We we measure pass@1 (correctness solutions with first attempts), that means, that model receives only one its answer. This strict method evaluation, not allowing models use methods and errors. This score helps understand, how well well their model can new or concepts "on " with number examples, that especially important for applications, where users can provide only several samples for training systems new taskSelf-reported
63.6%

Mathematics

Mathematical problems and computations
MGSM
0-shot Accuracy Accuracy at 0-shot means percentage correct answers without examples (training examples) for solutions tasks. When 0-shot model should solve task, using only its knowledge, obtained at training, and instructions, in accuracy at 0-shot is one from main more large and language models, since allows them perform more tasks without training. In this ability model generate correct answer at first attempt without any-or prompts or examplesSelf-reported
60.7%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
Few-shot Accuracy Evaluation accuracy few-shot (training on several examples) measures ability model solve tasks at number examples. In difference from testing zero-shot, where model should perform task without preliminary examples, few-shot testing provides model several examples tasks and her/its solutions before that, how ask model solve new This method especially important for evaluation abilities model to training in context and to new tasks, that is systems AI. Few-shot testing scenarios where users can provide several examples behavior, before than that model task independently. evaluation includes: 1. model k examples tasks and solutions (where k usually often from 1 to 5) 2. new tasks that indeed type 3. accuracy answer model accuracy can in dependency from type tasks, from exact for tasks with answers to more scores for tasks, diverse correct answersSelf-reported
52.9%
DROP
1-shot Token F1 score Indicator Token F1 score measures accuracy model on level tokens in scenarios 1-shot training. He is calculated by means of comparison model with reference answer on level tokens. For each tasks model is provided one example, correct format answer. Use metrics F1 on level tokens especially useful for tasks, where important exact match or such how actual information or exact Indicator reflects between (all correct tokens) and accuracy (or tokens). 1-shot approach allows evaluate ability model on basis examples, that important for real scenarios use, where trainingSelf-reported
60.8%
GPQA
Diamond, 0-shot RelaxedAccuracy/accuracy score efficiency model Diamond on GPQA is RelaxedAccuracy, which evaluates ability model best answer from 4 options. RelaxedAccuracy 1 score, if model answer matches with reference answer, and 0 in case. Despite on then, that instructions GPQA require from models choose one from options answer (A, B, C or D), and also its choice, we discovered, that model sometimes several options, not give answer or new answer, not nor from options. Therefore we we use two approach for extraction answers: 1. Analysis first answer: We (A, B, C or D), in answer model. 2. by : We we compute log-probability for each from 4 options answer and we choose option with For first approach we we use simple which answer, in text. If model includes more one (for example, "Answer: A and C"), we we choose (in given example "A")Self-reported
23.7%

Other Tests

Specialized benchmarks
AIME 2025
Accuracy at 0-shot AI: Accuracy at 0-shotSelf-reported
11.6%
ARC-C
25-shot accuracy from and scores efficiency LLM — their accuracy at execution tasks in 25-We model on set from n tasks and simply we measure general percentage correct answers. Despite on then, that measurement accuracy necessary note following : 1. For some tasks, such how mathematical tasks, can be only one correct answer, and we answer LLM how correct only in that case, if he exactly matches correct answer. 2. For other tasks, such how or training, can be several "correct" answers, and we we evaluate answer LLM on basis that, contains whether he all key components answer. 3. When LLM use tools (for example, function computation), we we evaluate final answer, and not intermediate steps. However note, that LLM, which uses reasoning and correctly applies tools, with to correct answerSelf-reported
61.6%
ARC-E
0-shot accuracySelf-reported
81.6%
BoolQ
0-shot accuracySelf-reported
81.6%
Codegolf v2.2
execution with first attempts without examples This method measures probability that, that model correct answer with first attempts, without preliminary examples correct solutions. He directly with that, how people use LLM-model in : they question and evaluate first answer. If answer incorrect, they can question or prompt, but these require additional metric is evaluation performance model, since she/it not accounts for capability several attempts or use methods, results (for example, chain-of-thought or answers)Self-reported
16.8%
ECLeKTic
0-shot ECLeKTic score ECLeKTic — this set tools for abilities language models to solving tasks numbers. ECLeKTic consists from 135 tasks, 9 types reasoning. For each from reasoning is three tasks, tasks in each so, in order to all more complex. Tasks in different so, in order to evaluate specific abilities LLM, which in with specific skills from to He also includes 54 tasks with or ECLeKTic score — this evaluation in from 0 to 1, where 1 matchesSelf-reported
1.9%
Global-MMLU
0-accuracySelf-reported
60.3%
Global-MMLU-Lite
## 0-shot accuracy Accuracy 0-shot (example) — this accuracy model without any-or demonstrations or examples. This score important, since he measures base ability model perform task without additional context. In then time how majority real applications will provide model context and examples, accuracy 0-shot gives representation about basic knowledge and capabilities model. We different 0-shot: - **0-shot-direct**: model should provide answer directly without any-or additional instructions, question - **0-shot-direct-plus**: model should provide answer directly with some but without examples - **0-shot-CoT**: model should show chain reasoning before provision answer, without examples that, how this make - **0-shot-Program**: model should for solutions tasks, without examplesSelf-reported
64.5%
HiddenMath
0-shot accuracySelf-reported
37.7%
Include
## 0-shot Accuracy AI: I I will text about model AI, allSelf-reported
57.2%
LiveCodeBench
0-shot pass@1 AI: with first attempts without examples This score, ability model successfully solve tasks with first attempts without provision examples. He measures percentage tasks, which model can solve correctly without preliminary prompts or examples that, how perform task. High score 0-shot pass@1 about base abilities model to generalization knowledge, and instructions, using only own internal representations and already knowledge. This especially important at evaluation abilities models to reasoning, so how reflects their understanding, and not simply samples solutions. When testing 0-shot pass@1 model is provided only task without which-or additional information about that, how her/its solve or examples similar solutionsSelf-reported
13.2%
LiveCodeBench v5
0-shot pass@1 AI: pass at 0-shot means, that model tries solve problem with first attempts without examples, prompts or preliminary training specific task. "pass@1" indicates on solution with first attempts. This score, since he demonstrates ability model process new tasks without additional context. High score 0-shot pass@1 usually indicates on model with understanding and ability apply its knowledge to When evaluation this metrics model receives task without any-or examples or context and should immediately give correct answer. This approach differs from methods few-shot or fine-tuning, where model receives examples or for specific tasksSelf-reported
25.7%
MMLU-Pro
Accuracy at testing without examples AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. ``` Example-Augmented System Prompting Before querying Claude, we append demonstrations to Claude's system prompt of how Claude should solve a given problem. Specifically, we take the actual problem, remove any numbers, and replace them with variables to create a template problem similar to the original one. We include N=5 examples of this template problem, each with different values of the variables, along with a detailed solution for each. Our template is: "I will solve this step-by-step, carefully tracking units. I'll break down the solution into clear logical steps, working towards finding [ANSWER]." We use a more extensive template for human evaluation (Appendix C). We choose these prompted formats after a combination of our own reasoning about the task and some empirical trials. We also want to standardize the answer format. We add "I'll express the final answer in the form 'Answer: [number] [units]'." to the system prompt to encourage the model to conclude with a cleanly specified answer. For our pre-solved examples, we use GPT-4 Turbo to (1) create template problems by replacing the numerical values with variables, then (2) instantiate these templates with new numerical values, and (3) generate solutions to these instantiated problems. A disadvantage of this method is that the system prompt has a character limit that prevents us from adding too many examples. The longest prompts in our dataset are already too long for this method. ```Self-reported
50.6%
MMLU-ProX
0-accuracySelf-reported
19.9%
Natural Questions
5-shot accuracy We 5-shot accuracy model. During-manner 5 examples from set in capacity context. Then model on questions set, using this context. In order to with choice examples, we this process with 10 different from 5 examples and accuracySelf-reported
20.9%
PIQA
0-shot accuracySelf-reported
81.0%
Social IQa
0-shot accuracySelf-reported
50.0%
TriviaQA
5-shot accuracy 5-shot accuracy — this score performance model, at which model has access to five examples questions and answers before provision answer on question. These examples usually contain questions that indeed type, that and target question, and they how context or prompt, model understand task. 5-shot accuracy measures percentage correct answers model in this scenarios. During many tasks 5-shot accuracy significantly above, than 0-shot accuracy (when model not has access to examples), that demonstrates ability model quickly adapt to new tasks with help several examples, without necessitySelf-reported
70.2%
WMT24++
ChrF, F-measure on level characters without This metric quality machine translation, which F-on level characters between and reference She/It on n-characters and provides more evaluation between than metrics on level words. ChrF especially useful for languages, where in can have on evaluation, on but on Metric first n-characters (usually to 6-) for and texts, then measures their with using accuracy and completeness (F-measure). Metric to characters and can include although total atSelf-reported
50.1%

License & Metadata

License
gemma
Announcement Date
May 20, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.