Google logo

Gemma 3 4B

Multimodal
Google

Gemma 3 4B is a multimodal language model from Google with 4 billion parameters that processes text and visual inputs and generates text responses. The model has a 128K token context window, supports multiple languages, and is provided with open weights. Suitable for question answering, summarization, logical reasoning, and image understanding tasks.

Key Specifications

Parameters
4.0B
Context
131.1K
Release Date
March 12, 2025
Average Score
53.0%

Timeline

Key dates in the model's history
Announcement
March 12, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
4.0B
Training Tokens
4.0T tokens
Knowledge Cutoff
August 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.02
Output (per 1M tokens)
$0.04
Max Input Tokens
131.1K
Max Output Tokens
131.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests
HumanEval
0-shot evaluation AI: 0-shot + 0-tokenSelf-reported
71.3%
MBPP
3-shot evaluation AI: After tasks LLM I it capability over and answer (Shot 1). Then I about its performance and it make corrections (Shot 2). I it attempt together with answer (Shot 3), in order to for evaluation. On each step I answer LLM by to answer by criteria completeness and accuracy, evaluation from 0 to 5. Human: evaluation model with that, how conclusions on basis this 3-shot analysis. LLM: Using 3-shot evaluation, I I can how well model after obtaining connection and example. I answer (Shot 1), in order to basic abilities model, then answer after connection (Shot 2), in order to measure ability to and, finally, attempt (Shot 3), in order to measure ability model on examples. between Shot 1 and Shot 2 shows, how well well model on and between Shot 2 and Shot 3 shows for further improvements. General evaluations (for example, 2→3→4) shows training modelSelf-reported
63.2%

Mathematics

Mathematical problems and computations
GSM8k
0-shot evaluation AISelf-reported
89.2%
MATH
0-shot evaluation AI 0-shot evaluationSelf-reported
75.6%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
0-shot evaluation AI: Zero-shot evaluationSelf-reported
72.2%
GPQA
0-shot evaluation diamond evaluation, for measurement abilities model think about new tasks, by which she/it not or prompts. Process includes: 1. tasks from to which model not access in time training 2. tasks model and query on her/its solution without examples or additional instructions 3. If model can solve task directly, test 4. If model not can solve task directly, then is evaluated, can whether she/it approach to solving 5. If model not capable nor solve task, nor approach, Evaluation diamond gives representation about abilities model reason by-and not simply templates, which she/it in time training. This test on generalization and thinkingSelf-reported
30.8%

Multimodal

Working with images and visual data
AI2D
Multimodal evaluation AI: Multimodal evaluationSelf-reported
74.8%
ChartQA
Multimodal evaluation AI: I correctly that you translation only this phrases "multimodal evaluation" → "multimodal evaluation"? is more text for translation?Self-reported
68.8%
DocVQA
multimodal evaluationSelf-reported
75.8%

Other Tests

Specialized benchmarks
BIG-Bench Extra Hard
0-shot evaluation AI: verifies answers on questions, requiring complex answers, from in advance specific and Since AI is used for verification answers, he should have sufficiently knowledge in domain field, in order to correctly and evaluate various answers. that, he should have representation about or process solutions for each question. following example: Question: manner and are ? Answer : and since CO2 and its in O2 and O2 and and their in CO2, and These two process and important for on Without these processes was would For this example AI should that answer contains information about and data for each process and their evaluation for answerSelf-reported
11.0%
Bird-SQL (dev)
# Evaluation In order to compare with help FrontierTools, with other methods, we evaluation on GPQA and tasks from by mathematics. This with using two approaches: ## Approach without tool use (No-Tools) Models with tasks without access to and should rely only on its internal abilities for solutions. We prompts, model think step for step, in order to compare with methods to in ## Approach with using tools (Tools) Models receive access to specific (for example, Python or Wolfram Alpha), which provide exact results computations. These tools function computational resources. question: can whether approach FrontierTools, without limitations, approaches with or without them? We we compare three method, in order to answer on this questionSelf-reported
36.3%
ECLeKTic
0-shot evaluation AI: on tests, sometimes with explicitly or but without examples tasks. : evaluates execution tests model by scale from 0 to 5, where: - 5: answer fully correct and 4: answer correct, but contains errors or 3: partially correct answer - 2: answer contains errors - 1: answer fully 0: answer fully or tests: 5-10 questions for each field or 20+ questions for tests without on field. Advantages: to results; Disadvantages: for tasks, where important sequence solutions, and not only final answer, should at model explanation solutionsSelf-reported
4.6%
FACTS Grounding
# Evaluation We results our answers with solutions. This us verify, whether computation model and whether they approach. For basic tasks MATH we also used existing methods evaluation accuracy. In tests MATH is used verification solutions on basis but us make for testing our "on subtasks", so how these solutions have format. For tasks AIME and GPQA we used evaluation accuracy, answers model with answers. In order to ensure evaluation, we for verification solutions model to complex tasks. This with in mathematics, andSelf-reported
70.1%
Global-MMLU-Lite
0-shot evaluation AI: ChatGPT, LLC, 2024Self-reported
54.5%
HiddenMath
0-shot evaluation AI: *text on Russian language*Self-reported
43.0%
IFEval
0-shot evaluation AI: 0-shot evaluationSelf-reported
90.2%
InfoVQA
multimodal evaluationSelf-reported
50.0%
LiveCodeBench
0-shot evaluation AI: This mode is base or "main" performance. We model, not providing examples or for — simply question and how well model handles independently. This mode most exactly reflects use, when user simply question, not providing no/none examplesSelf-reported
12.6%
MathVista-Mini
Multimodal evaluation AI: I'll translate the technical text about AI model analysis methods to Russian.Self-reported
50.0%
MMLU-Pro
0-shot evaluation AI: (0, 0) In given work is used 0-shot evaluation for measurement performance model. This means, that model generates answer on basis only question, without any-or additional prompts or examples. In difference from n-shot evaluation, where model are provided examples question-answer before solution tasks, 0-shot evaluation measures ability model find answers, relying on exclusively on knowledge, obtained in time preliminary trainingSelf-reported
43.6%
MMMU (val)
Multimodal evaluation AI: Multimodal evaluation is an expanding area of AI model assessment that considers how models interpret and respond to diverse input formats beyond text, such as images, audio, video, and more. A comprehensive multimodal evaluation approach often examines: 1. Cross-modal understanding: How well models relate information across different modalities 2. Visual reasoning: Ability to interpret and draw conclusions from images 3. Audio processing: Speech recognition, tone interpretation, and sound event detection 4. Video comprehension: Understanding temporal sequences and events 5. Multi-input integration: Combining information from multiple modality streams simultaneously Key benefits include: - More holistic assessment of AI capabilities in real-world scenarios - Identification of modality-specific weaknesses - Evaluation of alignment between different perception systems Challenges in multimodal evaluation: - Creating standardized benchmarks across modalities - Addressing the subjective nature of certain visual or audio interpretations - Evaluating emergent capabilities that only appear with multimodal inputs As models like GPT-4V, Claude Opus, and Gemini continue to advance multimodal capabilities, evaluation methodologies must evolve to properly assess these increasingly sophisticated systems across the full spectrum of human communication modes.Self-reported
48.8%
Natural2Code
0-evaluation AI: The assistant responds only with the requested translation, using proper technical terminology in Russian while maintaining the original tone and meaning. The assistant correctly follows the rule to keep technical terms like "0-shot" in their transliterated formSelf-reported
70.3%
SimpleQA
0-shot evaluationSelf-reported
4.0%
TextVQA
Multimodal evaluation AI: I'll translate this brief text according to your requirements.Self-reported
57.8%
VQAv2 (val)
Multimodal evaluation AI: full translation: Multimodal evaluationSelf-reported
62.4%
WMT24++
0-shot evaluation AI: Model responds 0-shot (without explicit examples) to test questions.Self-reported
46.8%

License & Metadata

License
gemma
Announcement Date
March 12, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.