Google logo

Gemma 3 1B

Google

Gemma 3 1B is a lightweight language model from Google with one billion parameters, optimized for efficient operation on resource-constrained devices. At 529MB, it processes text at 2,585 tokens per second with a 128,000 token context window. The model supports over 35 languages but works only with text data, unlike the larger multimodal Gemma models. This balance of speed and efficiency makes it ideal for fast text processing on mobile and low-power devices.

Key Specifications

Parameters
1.0B
Context
-
Release Date
March 12, 2025
Average Score
29.9%

Timeline

Key dates in the model's history
Announcement
March 12, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
1.0B
Training Tokens
2.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests
HumanEval
0-shot evaluation AI: 0Self-reported
41.5%
MBPP
3-shot evaluation AI: 3 example, similar task. Then I its answer. Example such evaluation: 1. set tasks, for which at you is answers. 2. 3 example from set (usually manner). 3. model solve new task, it 3 previous example. 4. answer model with Advantages: • Model specific examples correct approach to task. • Evaluation more since on specific answers. • This method can work how for simple tasks, so and for complex reasoning. Disadvantages: • from quality selected examples. • Model can simply examples without deep understanding. • create tasks with solutionsSelf-reported
35.2%

Mathematics

Mathematical problems and computations
GSM8k
0-shot evaluation AI: large language models (LLM) solve tasks without any-or examples or instructions usually 0-shot 0-shot evaluation means, that task is provided model "how is", without any-or additional instructions or examples correct answers. This approach to evaluation gives most measurement basic capabilities model, since improvements performance, which can at training on examples (few-shot training) or at use special prompts. Human: We we evaluate LLM, model question without any-or additional instructions, prompts or examples. Then we we evaluate answer, using in advanceSelf-reported
62.8%
MATH
0-shot evaluation AISelf-reported
48.0%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
0-shot evaluation AI: *thinking* ** For each assignments we we evaluate performance model in 0-shot, when on model only assignment without any-or previous examples. In research language models under "0-shot" usually provision assignments and that model correct answer without provision examples execution similar assignments. that some model with prompts, therefore their 0-shot can specific instructions for obtaining more answer. We follow for each modelSelf-reported
39.1%
GPQA
0-shot evaluation diamond 0-shot evaluation diamond - this way evaluation language models (LLM), which allows full their capabilities and limitations in tasks. This method includes testing model by (and "diamond" - ): 1. abilities: Evaluation capabilities, such how understanding language, reasoning, knowledge and solution tasks without instructions. 2. capabilities: skills in fields, such how programming, mathematics, or scientific 3. Disadvantages: Verification on known weak LLM, including context, instructions and 4. : such how efficiency, and ability model at solving complex tasks. this approach in that, that he not simply results by and capabilities model. This allows and understand strong and sides specific LLM and determine, suits whether she/it for specific applications or cases useSelf-reported
19.2%

Other Tests

Specialized benchmarks
BIG-Bench Extra Hard
0-shot evaluation AI: Artificial IntelligenceSelf-reported
7.2%
Bird-SQL (dev)
## Evaluation In research we we compare performance Claude-3-Opus, Claude-3-Sonnet and GPT-4 on diverse mathematical tasks. For most evaluation capabilities these models we we use mathematical benchmarks, and also own tasks, for identification limitations models. evaluation includes: **Tasks from benchmarks** - MATH: set from complex mathematical tasks and level - GPQA: complex and scientific questions, requiring reasoning - GSM8K: set mathematical tasks initial and school **tasks** - tasks by including evidence and Tasks by including with and tasks by mathematics, on AIME, FrontierMath and Harvard-MIT Mathematics Tournament For evaluation we we use following methods: 1. all queries 2. and verification correctness answers 3. Analysis reasoning models, and not only answers 4. with various and tasks for measurement resultsSelf-reported
6.4%
ECLeKTic
0-shot evaluation AI • For testing questions API model AI (for example, if GPT-4, API GPT-4). • 0 or other value, in order to ensure answers. • For each question system instructions, such how: "You AI When you question, on him how can better." • details, • question model. • answer and its for evaluation. Human • that indeed question in field (). • instructions, such how: "Please, on following question manner, relying on on its knowledge." • answers and their for comparison. Comparison • answers AI with answers human-• significant differences in approach, accuracy or • where AI gives answers and whereSelf-reported
1.4%
FACTS Grounding
# Evaluation In this research for evaluation presented three set benchmarks: 1. **tasks level complexity**. We we evaluate model on MATH (level ), GSM8k and GSM-Hard (tasks in initial and ), AIME (for ) and GPQA (complex reasoning). 2. **"scientific benchmarks"**: MMLU and Chatbot Arena Elo. Although these benchmarks well they with 3. **benchmark for **: we developed new tool for cases, when model statements with high This benchmark will together with models. If not all model with temperature 0 for answers. queries in researchSelf-reported
36.4%
Global-MMLU-Lite
0-shot evaluation AI: following goal - mathematical in more form. Task: 3 * (x + 2) - 2 * (x - 1). its. First I 3 * (x + 2) = 3x + 6 2 * (x - 1) = 2x - 2 together: 3 * (x + 2) - 2 * (x - 1) = (3x + 6) - (2x - 2) = 3x + 6 - 2x + 2 = x + 8 3 * (x + 2) - 2 * (x - 1) to x + 8. Human: 5 * (y - 3) + 2 * (y + 4)Self-reported
34.2%
HiddenMath
0-shot evaluation AI: 1-31-2024Self-reported
15.8%
IFEval
0-shot-evaluation AI: Evaluating the model with no examples of the task.Self-reported
80.2%
LiveCodeBench
0-shot evaluation AI: evaluation capabilities 0-shot is most evaluation, in which model performs assignment without any-or examples. This matches that, that model can do immediately, without to examples or other and result. Therefore evaluation 0-shot is most strict testing, although and that, how model usually are usedSelf-reported
1.9%
MMLU-Pro
0-shot evaluation AI: ChatGPT-4 solves tasks without which-or or settings. This means, that model should answer on questions or solve tasks without preliminary examples or instructions, using only information, in Such approach important, since he basic abilities model, not on informationSelf-reported
14.7%
Natural2Code
0-shot evaluation AI: answer on question Human: correct whether this answer : - when at AI no access to for and Not requires training model - benchmarks, on this approach, such how MMLU, GPQA, GSM8K, MATH and etc.etc. : - Not ability model find and use information from resources - Not reflects, how model can be in world (people only on ) - Model can "" correct answer by incorrect (data, without understanding)Self-reported
56.0%
SimpleQA
0-shot evaluation Evaluation without preliminary examples In this method model directly solves task, not no/none examples solutions for similar tasks. Model only on its preliminarily trained knowledge. This most strict method evaluation, so how model should exclusively on its internal knowledge and reasoning, obtained in time preliminary training. If model can correctly solve tasks without preliminary examples, this is that, that she/it indeed understands problems, and not simply templates from examples to new task. Evaluation without preliminary examples especially important for testing abilities models to mathematical reasoning, so how she/it allows measure their ability solve problems, and not simply followSelf-reported
2.2%
WMT24++
0-shot evaluation AI : We for each from models and each model directly solve task. Evaluation: We quality by two : result, and number for obtaining result. Models: We all version Claude, all version GPT, all model. In we : - Claude Haiku, Claude Sonnet, Claude Opus - GPT-3.5, GPT-4-Turbo, GPT-4 - LLaMA 3 70B, Gemma 27B, DBRX 132B, Command R+ (Command R Plus) Performance: We also measure time, models for execution tasks, and number tokens, for tasksSelf-reported
35.9%

License & Metadata

License
gemma
Announcement Date
March 12, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.