Gemma 3n E4B Instructed LiteRT Preview

Name: Gemma 3n E4B Instructed LiteRT Preview
Author: Google

Multimodal

Google

Gemma 3n is a generative AI model optimized for everyday devices such as phones, laptops, and tablets. The model incorporates innovations like Per-Layer Embedding (PLE) parameter caching and the MatFormer model architecture to reduce computational and memory requirements. These models process audio, text, and visual data, though this E4B preview currently supports text and visual input. Gemma is a family of lightweight, state-of-the-art open models from Google, built on the same research and technology used to create Gemini models, and licensed for responsible commercial use.

Key Specifications

Parameters

1.9B

Context

Release Date

May 20, 2025

Average Score

50.3%

Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

May 20, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

1.9B

Training Tokens

Knowledge Cutoff

June 1, 2024

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

10-shot accuracy AI: 0-shot or 1-shot methods often not full language models at evaluation their capabilities reasoning. In this work we we evaluate improvements performance LLM at numbers examples in prompt. Using assignments, requiring reasoning, we that accuracy can with using 10 examples instead 0 or 1. Advantages 10-evaluation: 1. in that, that model not receives from format assignments or type training 2. for models, only through API 3. Allows more exactly measure performance model on tasks reasoning Method: model 10 questions and answers to that, how she/it on target question • Self-reported

78.6%

MMLU

0-shot accuracy • Self-reported

64.9%

Winogrande

Accuracy on 5 examples We we compute accuracy model on using 5-prompts, where in capacity examples are used other questions and answers from that indeed assignments. These manner from set data. These 5-evaluation provide more model, than since they errors in interpretation instructions to with side model • Self-reported

71.7%

Programming

Programming skills tests

HumanEval

Task with first attempts without examples several benchmarks with choice answer, such how MMLU, evaluation language models. In these benchmarks models are provided questions and several options answer, from which model should choose correct. on these tests usually questions, on which model gives correct answer with first attempts. In cases this score "success with first attempts" (pass@1) exclusively useful, since he to that, how people-tasks. However in other cases, especially in tasks, requiring thinking and reasoning, such approach can not full capabilities model. For example, human, task, can first solution, but then its error and her/its. Indicator pass@1 not accounts for such corrections. In our work we evaluation, how often model can obtain correct answer at examples (0-shot) and with first attempts. This gives more full representation about reliability model in various scenarios, especially in those, where is required reasoning • Self-reported

75.0%

MBPP

3-shot pass@1 In given we we evaluate ability model work with example from three tasks with solutions (3-shot) and then apply those indeed reasoning for solutions new tasks. This test measures: - training: can whether model patterns from several examples - : capable whether model apply approach to new task - : can whether model in its reasoning We we measure pass@1 (correctness solutions with first attempts), that means, that model receives only one its answer. This strict method evaluation, not allowing models use methods and errors. This score helps understand, how well well their model can new or concepts "on " with number examples, that especially important for applications, where users can provide only several samples for training systems new task • Self-reported

63.6%

Mathematics

Mathematical problems and computations

MGSM

0-shot Accuracy Accuracy at 0-shot means percentage correct answers without examples (training examples) for solutions tasks. When 0-shot model should solve task, using only its knowledge, obtained at training, and instructions, in accuracy at 0-shot is one from main more large and language models, since allows them perform more tasks without training. In this ability model generate correct answer at first attempt without any-or prompts or examples • Self-reported

60.7%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

Few-shot Accuracy Evaluation accuracy few-shot (training on several examples) measures ability model solve tasks at number examples. In difference from testing zero-shot, where model should perform task without preliminary examples, few-shot testing provides model several examples tasks and her/its solutions before that, how ask model solve new This method especially important for evaluation abilities model to training in context and to new tasks, that is systems AI. Few-shot testing scenarios where users can provide several examples behavior, before than that model task independently. evaluation includes: 1. model k examples tasks and solutions (where k usually often from 1 to 5) 2. new tasks that indeed type 3. accuracy answer model accuracy can in dependency from type tasks, from exact for tasks with answers to more scores for tasks, diverse correct answers • Self-reported

52.9%

DROP

1-shot Token F1 score Indicator Token F1 score measures accuracy model on level tokens in scenarios 1-shot training. He is calculated by means of comparison model with reference answer on level tokens. For each tasks model is provided one example, correct format answer. Use metrics F1 on level tokens especially useful for tasks, where important exact match or such how actual information or exact Indicator reflects between (all correct tokens) and accuracy (or tokens). 1-shot approach allows evaluate ability model on basis examples, that important for real scenarios use, where training • Self-reported

60.8%

GPQA

Diamond, 0-shot RelaxedAccuracy/accuracy score efficiency model Diamond on GPQA is RelaxedAccuracy, which evaluates ability model best answer from 4 options. RelaxedAccuracy 1 score, if model answer matches with reference answer, and 0 in case. Despite on then, that instructions GPQA require from models choose one from options answer (A, B, C or D), and also its choice, we discovered, that model sometimes several options, not give answer or new answer, not nor from options. Therefore we we use two approach for extraction answers: 1. Analysis first answer: We (A, B, C or D), in answer model. 2. by : We we compute log-probability for each from 4 options answer and we choose option with For first approach we we use simple which answer, in text. If model includes more one (for example, "Answer: A and C"), we we choose (in given example "A") • Self-reported

23.7%

Other Tests

Specialized benchmarks

AIME 2025

Accuracy at 0-shot AI: Accuracy at 0-shot • Self-reported

11.6%

ARC-C

25-shot accuracy from and scores efficiency LLM — their accuracy at execution tasks in 25-We model on set from n tasks and simply we measure general percentage correct answers. Despite on then, that measurement accuracy necessary note following : 1. For some tasks, such how mathematical tasks, can be only one correct answer, and we answer LLM how correct only in that case, if he exactly matches correct answer. 2. For other tasks, such how or training, can be several "correct" answers, and we we evaluate answer LLM on basis that, contains whether he all key components answer. 3. When LLM use tools (for example, function computation), we we evaluate final answer, and not intermediate steps. However note, that LLM, which uses reasoning and correctly applies tools, with to correct answer • Self-reported

61.6%

ARC-E

0-shot accuracy • Self-reported

81.6%

BoolQ

0-shot accuracy • Self-reported

81.6%

Codegolf v2.2

execution with first attempts without examples This method measures probability that, that model correct answer with first attempts, without preliminary examples correct solutions. He directly with that, how people use LLM-model in : they question and evaluate first answer. If answer incorrect, they can question or prompt, but these require additional metric is evaluation performance model, since she/it not accounts for capability several attempts or use methods, results (for example, chain-of-thought or answers) • Self-reported

16.8%

ECLeKTic

0-shot ECLeKTic score ECLeKTic — this set tools for abilities language models to solving tasks numbers. ECLeKTic consists from 135 tasks, 9 types reasoning. For each from reasoning is three tasks, tasks in each so, in order to all more complex. Tasks in different so, in order to evaluate specific abilities LLM, which in with specific skills from to He also includes 54 tasks with or ECLeKTic score — this evaluation in from 0 to 1, where 1 matches • Self-reported

1.9%

Global-MMLU

0-accuracy • Self-reported

60.3%

Global-MMLU-Lite

## 0-shot accuracy Accuracy 0-shot (example) — this accuracy model without any-or demonstrations or examples. This score important, since he measures base ability model perform task without additional context. In then time how majority real applications will provide model context and examples, accuracy 0-shot gives representation about basic knowledge and capabilities model. We different 0-shot: - **0-shot-direct**: model should provide answer directly without any-or additional instructions, question - **0-shot-direct-plus**: model should provide answer directly with some but without examples - **0-shot-CoT**: model should show chain reasoning before provision answer, without examples that, how this make - **0-shot-Program**: model should for solutions tasks, without examples • Self-reported

64.5%

HiddenMath

0-shot accuracy • Self-reported

37.7%

Include

## 0-shot Accuracy AI: I I will text about model AI, all • Self-reported

57.2%

LiveCodeBench

0-shot pass@1 AI: with first attempts without examples This score, ability model successfully solve tasks with first attempts without provision examples. He measures percentage tasks, which model can solve correctly without preliminary prompts or examples that, how perform task. High score 0-shot pass@1 about base abilities model to generalization knowledge, and instructions, using only own internal representations and already knowledge. This especially important at evaluation abilities models to reasoning, so how reflects their understanding, and not simply samples solutions. When testing 0-shot pass@1 model is provided only task without which-or additional information about that, how her/its solve or examples similar solutions • Self-reported

13.2%

LiveCodeBench v5

0-shot pass@1 AI: pass at 0-shot means, that model tries solve problem with first attempts without examples, prompts or preliminary training specific task. "pass@1" indicates on solution with first attempts. This score, since he demonstrates ability model process new tasks without additional context. High score 0-shot pass@1 usually indicates on model with understanding and ability apply its knowledge to When evaluation this metrics model receives task without any-or examples or context and should immediately give correct answer. This approach differs from methods few-shot or fine-tuning, where model receives examples or for specific tasks • Self-reported

25.7%

MMLU-Pro

Accuracy at testing without examples AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. ``` Example-Augmented System Prompting Before querying Claude, we append demonstrations to Claude's system prompt of how Claude should solve a given problem. Specifically, we take the actual problem, remove any numbers, and replace them with variables to create a template problem similar to the original one. We include N=5 examples of this template problem, each with different values of the variables, along with a detailed solution for each. Our template is: "I will solve this step-by-step, carefully tracking units. I'll break down the solution into clear logical steps, working towards finding [ANSWER]." We use a more extensive template for human evaluation (Appendix C). We choose these prompted formats after a combination of our own reasoning about the task and some empirical trials. We also want to standardize the answer format. We add "I'll express the final answer in the form 'Answer: [number] [units]'." to the system prompt to encourage the model to conclude with a cleanly specified answer. For our pre-solved examples, we use GPT-4 Turbo to (1) create template problems by replacing the numerical values with variables, then (2) instantiate these templates with new numerical values, and (3) generate solutions to these instantiated problems. A disadvantage of this method is that the system prompt has a character limit that prevents us from adding too many examples. The longest prompts in our dataset are already too long for this method. ``` • Self-reported

50.6%

MMLU-ProX

0-accuracy • Self-reported

19.9%

Natural Questions

5-shot accuracy We 5-shot accuracy model. During-manner 5 examples from set in capacity context. Then model on questions set, using this context. In order to with choice examples, we this process with 10 different from 5 examples and accuracy • Self-reported

20.9%

PIQA

0-shot accuracy • Self-reported

81.0%

Social IQa

0-shot accuracy • Self-reported

50.0%

TriviaQA

5-shot accuracy 5-shot accuracy — this score performance model, at which model has access to five examples questions and answers before provision answer on question. These examples usually contain questions that indeed type, that and target question, and they how context or prompt, model understand task. 5-shot accuracy measures percentage correct answers model in this scenarios. During many tasks 5-shot accuracy significantly above, than 0-shot accuracy (when model not has access to examples), that demonstrates ability model quickly adapt to new tasks with help several examples, without necessity • Self-reported

70.2%

WMT24++

ChrF, F-measure on level characters without This metric quality machine translation, which F-on level characters between and reference She/It on n-characters and provides more evaluation between than metrics on level words. ChrF especially useful for languages, where in can have on evaluation, on but on Metric first n-characters (usually to 6-) for and texts, then measures their with using accuracy and completeness (F-measure). Metric to characters and can include although total at • Self-reported

50.1%

License & Metadata

License

gemma

Announcement Date

May 20, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Gemma 3n E2B

Google

MM8.0B

Best score:0.5 (ARC)

Released:Jun 2025

Gemma 3 4B

Google

MM4.0B

Best score:0.7 (HumanEval)

Released:Mar 2025

Price:$0.02/1M tokens

Gemma 3n E2B Instructed LiteRT (Preview)

Google

MM1.9B

Best score:0.7 (HumanEval)

Released:May 2025