Gemma 3 1B

Name: Gemma 3 1B
Author: Google

Google

Gemma 3 1B is a lightweight language model from Google with one billion parameters, optimized for efficient operation on resource-constrained devices. At 529MB, it processes text at 2,585 tokens per second with a 128,000 token context window. The model supports over 35 languages but works only with text data, unlike the larger multimodal Gemma models. This balance of speed and efficiency makes it ideal for fast text processing on mobile and low-power devices.

Key Specifications

Parameters

1.0B

Context

Release Date

March 12, 2025

Average Score

29.9%

API Documentation Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

March 12, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

1.0B

Training Tokens

2.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

HumanEval

0-shot evaluation AI: 0 • Self-reported

41.5%

MBPP

3-shot evaluation AI: 3 example, similar task. Then I its answer. Example such evaluation: 1. set tasks, for which at you is answers. 2. 3 example from set (usually manner). 3. model solve new task, it 3 previous example. 4. answer model with Advantages: • Model specific examples correct approach to task. • Evaluation more since on specific answers. • This method can work how for simple tasks, so and for complex reasoning. Disadvantages: • from quality selected examples. • Model can simply examples without deep understanding. • create tasks with solutions • Self-reported

35.2%

Mathematics

Mathematical problems and computations

GSM8k

0-shot evaluation AI: large language models (LLM) solve tasks without any-or examples or instructions usually 0-shot 0-shot evaluation means, that task is provided model "how is", without any-or additional instructions or examples correct answers. This approach to evaluation gives most measurement basic capabilities model, since improvements performance, which can at training on examples (few-shot training) or at use special prompts. Human: We we evaluate LLM, model question without any-or additional instructions, prompts or examples. Then we we evaluate answer, using in advance • Self-reported

62.8%

MATH

0-shot evaluation AI • Self-reported

48.0%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

0-shot evaluation AI: *thinking* ** For each assignments we we evaluate performance model in 0-shot, when on model only assignment without any-or previous examples. In research language models under "0-shot" usually provision assignments and that model correct answer without provision examples execution similar assignments. that some model with prompts, therefore their 0-shot can specific instructions for obtaining more answer. We follow for each model • Self-reported

39.1%

GPQA

0-shot evaluation diamond 0-shot evaluation diamond - this way evaluation language models (LLM), which allows full their capabilities and limitations in tasks. This method includes testing model by (and "diamond" - ): 1. abilities: Evaluation capabilities, such how understanding language, reasoning, knowledge and solution tasks without instructions. 2. capabilities: skills in fields, such how programming, mathematics, or scientific 3. Disadvantages: Verification on known weak LLM, including context, instructions and 4. : such how efficiency, and ability model at solving complex tasks. this approach in that, that he not simply results by and capabilities model. This allows and understand strong and sides specific LLM and determine, suits whether she/it for specific applications or cases use • Self-reported

19.2%

Other Tests

Specialized benchmarks

BIG-Bench Extra Hard

0-shot evaluation AI: Artificial Intelligence • Self-reported

7.2%

Bird-SQL (dev)

## Evaluation In research we we compare performance Claude-3-Opus, Claude-3-Sonnet and GPT-4 on diverse mathematical tasks. For most evaluation capabilities these models we we use mathematical benchmarks, and also own tasks, for identification limitations models. evaluation includes: **Tasks from benchmarks** - MATH: set from complex mathematical tasks and level - GPQA: complex and scientific questions, requiring reasoning - GSM8K: set mathematical tasks initial and school **tasks** - tasks by including evidence and Tasks by including with and tasks by mathematics, on AIME, FrontierMath and Harvard-MIT Mathematics Tournament For evaluation we we use following methods: 1. all queries 2. and verification correctness answers 3. Analysis reasoning models, and not only answers 4. with various and tasks for measurement results • Self-reported

6.4%

ECLeKTic

0-shot evaluation AI • For testing questions API model AI (for example, if GPT-4, API GPT-4). • 0 or other value, in order to ensure answers. • For each question system instructions, such how: "You AI When you question, on him how can better." • details, • question model. • answer and its for evaluation. Human • that indeed question in field (). • instructions, such how: "Please, on following question manner, relying on on its knowledge." • answers and their for comparison. Comparison • answers AI with answers human-• significant differences in approach, accuracy or • where AI gives answers and where • Self-reported

1.4%

FACTS Grounding

# Evaluation In this research for evaluation presented three set benchmarks: 1. **tasks level complexity**. We we evaluate model on MATH (level ), GSM8k and GSM-Hard (tasks in initial and ), AIME (for ) and GPQA (complex reasoning). 2. **"scientific benchmarks"**: MMLU and Chatbot Arena Elo. Although these benchmarks well they with 3. **benchmark for **: we developed new tool for cases, when model statements with high This benchmark will together with models. If not all model with temperature 0 for answers. queries in research • Self-reported

36.4%

Global-MMLU-Lite

0-shot evaluation AI: following goal - mathematical in more form. Task: 3 * (x + 2) - 2 * (x - 1). its. First I 3 * (x + 2) = 3x + 6 2 * (x - 1) = 2x - 2 together: 3 * (x + 2) - 2 * (x - 1) = (3x + 6) - (2x - 2) = 3x + 6 - 2x + 2 = x + 8 3 * (x + 2) - 2 * (x - 1) to x + 8. Human: 5 * (y - 3) + 2 * (y + 4) • Self-reported

34.2%

HiddenMath

0-shot evaluation AI: 1-31-2024 • Self-reported

15.8%

IFEval

0-shot-evaluation AI: Evaluating the model with no examples of the task. • Self-reported

80.2%

LiveCodeBench

0-shot evaluation AI: evaluation capabilities 0-shot is most evaluation, in which model performs assignment without any-or examples. This matches that, that model can do immediately, without to examples or other and result. Therefore evaluation 0-shot is most strict testing, although and that, how model usually are used • Self-reported

1.9%

MMLU-Pro

0-shot evaluation AI: ChatGPT-4 solves tasks without which-or or settings. This means, that model should answer on questions or solve tasks without preliminary examples or instructions, using only information, in Such approach important, since he basic abilities model, not on information • Self-reported

14.7%

Natural2Code

0-shot evaluation AI: answer on question Human: correct whether this answer : - when at AI no access to for and Not requires training model - benchmarks, on this approach, such how MMLU, GPQA, GSM8K, MATH and etc.etc. : - Not ability model find and use information from resources - Not reflects, how model can be in world (people only on ) - Model can "" correct answer by incorrect (data, without understanding) • Self-reported

56.0%

SimpleQA

0-shot evaluation Evaluation without preliminary examples In this method model directly solves task, not no/none examples solutions for similar tasks. Model only on its preliminarily trained knowledge. This most strict method evaluation, so how model should exclusively on its internal knowledge and reasoning, obtained in time preliminary training. If model can correctly solve tasks without preliminary examples, this is that, that she/it indeed understands problems, and not simply templates from examples to new task. Evaluation without preliminary examples especially important for testing abilities models to mathematical reasoning, so how she/it allows measure their ability solve problems, and not simply follow • Self-reported

2.2%

WMT24++

0-shot evaluation AI : We for each from models and each model directly solve task. Evaluation: We quality by two : result, and number for obtaining result. Models: We all version Claude, all version GPT, all model. In we : - Claude Haiku, Claude Sonnet, Claude Opus - GPT-3.5, GPT-4-Turbo, GPT-4 - LLaMA 3 70B, Gemma 27B, DBRX 132B, Command R+ (Command R Plus) Performance: We also measure time, models for execution tasks, and number tokens, for tasks • Self-reported

35.9%

License & Metadata

License

gemma

Announcement Date

March 12, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Gemma 2 9B

Google

9.2B

Best score:0.7 (MMLU)

Released:Jun 2024

Gemini 1.0 Pro

Google

Best score:0.7 (MMLU)

Released:Feb 2024

Price:$0.50/1M tokens

Gemini Diffusion

Google

Best score:0.9 (HumanEval)

Released:May 2025

Gemma 2 27B

Google

27.2B

Best score:0.8 (MMLU)

Released:Jun 2024

IBM Granite 4.0 Tiny Preview

IBM

7.0B

Best score:0.8 (HumanEval)

Released:May 2025

Gemini 1.5 Flash 8B

Google

MM8.0B

Best score:0.4 (GPQA)

Released:Mar 2024

Price:$0.07/1M tokens

MedGemma 4B IT

Google

MM4.3B

Released:May 2025

Gemma 3n E2B Instructed

Google

MM8.0B

Best score:0.7 (HumanEval)

Released:Jun 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.