Gemma 3 4B
MultimodalGemma 3 4B is a multimodal language model from Google with 4 billion parameters that processes text and visual inputs and generates text responses. The model has a 128K token context window, supports multiple languages, and is provided with open weights. Suitable for question answering, summarization, logical reasoning, and image understanding tasks.
Key Specifications
Parameters
4.0B
Context
131.1K
Release Date
March 12, 2025
Average Score
53.0%
Timeline
Key dates in the model's history
Announcement
March 12, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
4.0B
Training Tokens
4.0T tokens
Knowledge Cutoff
August 1, 2024
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.02
Output (per 1M tokens)
$0.04
Max Input Tokens
131.1K
Max Output Tokens
131.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Programming
Programming skills tests
HumanEval
0-shot evaluation
AI: 0-shot + 0-token • Self-reported
MBPP
3-shot evaluation AI: After tasks LLM I it capability over and answer (Shot 1). Then I about its performance and it make corrections (Shot 2). I it attempt together with answer (Shot 3), in order to for evaluation. On each step I answer LLM by to answer by criteria completeness and accuracy, evaluation from 0 to 5. Human: evaluation model with that, how conclusions on basis this 3-shot analysis. LLM: Using 3-shot evaluation, I I can how well model after obtaining connection and example. I answer (Shot 1), in order to basic abilities model, then answer after connection (Shot 2), in order to measure ability to and, finally, attempt (Shot 3), in order to measure ability model on examples. between Shot 1 and Shot 2 shows, how well well model on and between Shot 2 and Shot 3 shows for further improvements. General evaluations (for example, 2→3→4) shows training model • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
0-shot evaluation
AI • Self-reported
MATH
0-shot evaluation
AI
0-shot evaluation • Self-reported
Reasoning
Logical reasoning and analysis
BIG-Bench Hard
0-shot evaluation
AI: Zero-shot evaluation • Self-reported
GPQA
0-shot evaluation diamond evaluation, for measurement abilities model think about new tasks, by which she/it not or prompts. Process includes: 1. tasks from to which model not access in time training 2. tasks model and query on her/its solution without examples or additional instructions 3. If model can solve task directly, test 4. If model not can solve task directly, then is evaluated, can whether she/it approach to solving 5. If model not capable nor solve task, nor approach, Evaluation diamond gives representation about abilities model reason by-and not simply templates, which she/it in time training. This test on generalization and thinking • Self-reported
Multimodal
Working with images and visual data
AI2D
Multimodal evaluation
AI: Multimodal evaluation • Self-reported
ChartQA
Multimodal evaluation AI: I correctly that you translation only this phrases "multimodal evaluation" → "multimodal evaluation"? is more text for translation? • Self-reported
DocVQA
multimodal evaluation • Self-reported
Other Tests
Specialized benchmarks
BIG-Bench Extra Hard
0-shot evaluation AI: verifies answers on questions, requiring complex answers, from in advance specific and Since AI is used for verification answers, he should have sufficiently knowledge in domain field, in order to correctly and evaluate various answers. that, he should have representation about or process solutions for each question. following example: Question: manner and are ? Answer : and since CO2 and its in O2 and O2 and and their in CO2, and These two process and important for on Without these processes was would For this example AI should that answer contains information about and data for each process and their evaluation for answer • Self-reported
Bird-SQL (dev)
# Evaluation In order to compare with help FrontierTools, with other methods, we evaluation on GPQA and tasks from by mathematics. This with using two approaches: ## Approach without tool use (No-Tools) Models with tasks without access to and should rely only on its internal abilities for solutions. We prompts, model think step for step, in order to compare with methods to in ## Approach with using tools (Tools) Models receive access to specific (for example, Python or Wolfram Alpha), which provide exact results computations. These tools function computational resources. question: can whether approach FrontierTools, without limitations, approaches with or without them? We we compare three method, in order to answer on this question • Self-reported
ECLeKTic
0-shot evaluation AI: on tests, sometimes with explicitly or but without examples tasks. : evaluates execution tests model by scale from 0 to 5, where: - 5: answer fully correct and 4: answer correct, but contains errors or 3: partially correct answer - 2: answer contains errors - 1: answer fully 0: answer fully or tests: 5-10 questions for each field or 20+ questions for tests without on field. Advantages: to results; Disadvantages: for tasks, where important sequence solutions, and not only final answer, should at model explanation solutions • Self-reported
FACTS Grounding
# Evaluation We results our answers with solutions. This us verify, whether computation model and whether they approach. For basic tasks MATH we also used existing methods evaluation accuracy. In tests MATH is used verification solutions on basis but us make for testing our "on subtasks", so how these solutions have format. For tasks AIME and GPQA we used evaluation accuracy, answers model with answers. In order to ensure evaluation, we for verification solutions model to complex tasks. This with in mathematics, and • Self-reported
Global-MMLU-Lite
0-shot evaluation
AI: ChatGPT, LLC, 2024 • Self-reported
HiddenMath
0-shot evaluation AI: *text on Russian language* • Self-reported
IFEval
0-shot evaluation
AI: 0-shot evaluation • Self-reported
InfoVQA
multimodal evaluation • Self-reported
LiveCodeBench
0-shot evaluation AI: This mode is base or "main" performance. We model, not providing examples or for — simply question and how well model handles independently. This mode most exactly reflects use, when user simply question, not providing no/none examples • Self-reported
MathVista-Mini
Multimodal evaluation
AI: I'll translate the technical text about AI model analysis methods to Russian. • Self-reported
MMLU-Pro
0-shot evaluation AI: (0, 0) In given work is used 0-shot evaluation for measurement performance model. This means, that model generates answer on basis only question, without any-or additional prompts or examples. In difference from n-shot evaluation, where model are provided examples question-answer before solution tasks, 0-shot evaluation measures ability model find answers, relying on exclusively on knowledge, obtained in time preliminary training • Self-reported
MMMU (val)
Multimodal evaluation
AI: Multimodal evaluation is an expanding area of AI model assessment that considers how models interpret and respond to diverse input formats beyond text, such as images, audio, video, and more.
A comprehensive multimodal evaluation approach often examines:
1. Cross-modal understanding: How well models relate information across different modalities
2. Visual reasoning: Ability to interpret and draw conclusions from images
3. Audio processing: Speech recognition, tone interpretation, and sound event detection
4. Video comprehension: Understanding temporal sequences and events
5. Multi-input integration: Combining information from multiple modality streams simultaneously
Key benefits include:
- More holistic assessment of AI capabilities in real-world scenarios
- Identification of modality-specific weaknesses
- Evaluation of alignment between different perception systems
Challenges in multimodal evaluation:
- Creating standardized benchmarks across modalities
- Addressing the subjective nature of certain visual or audio interpretations
- Evaluating emergent capabilities that only appear with multimodal inputs
As models like GPT-4V, Claude Opus, and Gemini continue to advance multimodal capabilities, evaluation methodologies must evolve to properly assess these increasingly sophisticated systems across the full spectrum of human communication modes. • Self-reported
Natural2Code
0-evaluation AI: The assistant responds only with the requested translation, using proper technical terminology in Russian while maintaining the original tone and meaning. The assistant correctly follows the rule to keep technical terms like "0-shot" in their transliterated form • Self-reported
SimpleQA
0-shot evaluation • Self-reported
TextVQA
Multimodal evaluation
AI: I'll translate this brief text according to your requirements. • Self-reported
VQAv2 (val)
Multimodal evaluation AI: full translation: Multimodal evaluation • Self-reported
WMT24++
0-shot evaluation
AI: Model responds 0-shot (without explicit examples) to test questions. • Self-reported
License & Metadata
License
gemma
Announcement Date
March 12, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsGemma 3n E2B
MM8.0B
Best score:0.5 (ARC)
Released:Jun 2025
Gemma 3n E2B Instructed
MM8.0B
Best score:0.7 (HumanEval)
Released:Jun 2025
Gemma 3n E4B Instructed LiteRT Preview
MM1.9B
Best score:0.8 (HumanEval)
Released:May 2025
Gemma 3n E2B Instructed LiteRT (Preview)
MM1.9B
Best score:0.7 (HumanEval)
Released:May 2025
Gemma 3n E4B Instructed
MM8.0B
Best score:0.8 (HumanEval)
Released:Jun 2025
Price:$20.00/1M tokens
Gemini 1.5 Flash 8B
MM8.0B
Best score:0.4 (GPQA)
Released:Mar 2024
Price:$0.07/1M tokens
MedGemma 4B IT
MM4.3B
Released:May 2025
Gemma 3n E4B
MM8.0B
Best score:0.6 (ARC)
Released:Jun 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.