Gemma 3 4B

Name: Gemma 3 4B
Author: Google

Multimodal

Google

Gemma 3 4B is a multimodal language model from Google with 4 billion parameters that processes text and visual inputs and generates text responses. The model has a 128K token context window, supports multiple languages, and is provided with open weights. Suitable for question answering, summarization, logical reasoning, and image understanding tasks.

Key Specifications

Parameters

4.0B

Context

131.1K

Release Date

March 12, 2025

Average Score

53.0%

Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

March 12, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

4.0B

Training Tokens

4.0T tokens

Knowledge Cutoff

August 1, 2024

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.02

Output (per 1M tokens)

$0.04

Max Input Tokens

131.1K

Max Output Tokens

131.1K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

HumanEval

0-shot evaluation AI: 0-shot + 0-token • Self-reported

71.3%

MBPP

3-shot evaluation AI: After tasks LLM I it capability over and answer (Shot 1). Then I about its performance and it make corrections (Shot 2). I it attempt together with answer (Shot 3), in order to for evaluation. On each step I answer LLM by to answer by criteria completeness and accuracy, evaluation from 0 to 5. Human: evaluation model with that, how conclusions on basis this 3-shot analysis. LLM: Using 3-shot evaluation, I I can how well model after obtaining connection and example. I answer (Shot 1), in order to basic abilities model, then answer after connection (Shot 2), in order to measure ability to and, finally, attempt (Shot 3), in order to measure ability model on examples. between Shot 1 and Shot 2 shows, how well well model on and between Shot 2 and Shot 3 shows for further improvements. General evaluations (for example, 2→3→4) shows training model • Self-reported

63.2%

Mathematics

Mathematical problems and computations

GSM8k

0-shot evaluation AI • Self-reported

89.2%

MATH

0-shot evaluation AI 0-shot evaluation • Self-reported

75.6%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

0-shot evaluation AI: Zero-shot evaluation • Self-reported

72.2%

GPQA

0-shot evaluation diamond evaluation, for measurement abilities model think about new tasks, by which she/it not or prompts. Process includes: 1. tasks from to which model not access in time training 2. tasks model and query on her/its solution without examples or additional instructions 3. If model can solve task directly, test 4. If model not can solve task directly, then is evaluated, can whether she/it approach to solving 5. If model not capable nor solve task, nor approach, Evaluation diamond gives representation about abilities model reason by-and not simply templates, which she/it in time training. This test on generalization and thinking • Self-reported

30.8%

Multimodal

Working with images and visual data

AI2D

Multimodal evaluation AI: Multimodal evaluation • Self-reported

74.8%

ChartQA

Multimodal evaluation AI: I correctly that you translation only this phrases "multimodal evaluation" → "multimodal evaluation"? is more text for translation? • Self-reported

68.8%

DocVQA

multimodal evaluation • Self-reported

75.8%

Other Tests

Specialized benchmarks

BIG-Bench Extra Hard

0-shot evaluation AI: verifies answers on questions, requiring complex answers, from in advance specific and Since AI is used for verification answers, he should have sufficiently knowledge in domain field, in order to correctly and evaluate various answers. that, he should have representation about or process solutions for each question. following example: Question: manner and are ? Answer : and since CO2 and its in O2 and O2 and and their in CO2, and These two process and important for on Without these processes was would For this example AI should that answer contains information about and data for each process and their evaluation for answer • Self-reported

11.0%

Bird-SQL (dev)

# Evaluation In order to compare with help FrontierTools, with other methods, we evaluation on GPQA and tasks from by mathematics. This with using two approaches: ## Approach without tool use (No-Tools) Models with tasks without access to and should rely only on its internal abilities for solutions. We prompts, model think step for step, in order to compare with methods to in ## Approach with using tools (Tools) Models receive access to specific (for example, Python or Wolfram Alpha), which provide exact results computations. These tools function computational resources. question: can whether approach FrontierTools, without limitations, approaches with or without them? We we compare three method, in order to answer on this question • Self-reported

36.3%

ECLeKTic

0-shot evaluation AI: on tests, sometimes with explicitly or but without examples tasks. : evaluates execution tests model by scale from 0 to 5, where: - 5: answer fully correct and 4: answer correct, but contains errors or 3: partially correct answer - 2: answer contains errors - 1: answer fully 0: answer fully or tests: 5-10 questions for each field or 20+ questions for tests without on field. Advantages: to results; Disadvantages: for tasks, where important sequence solutions, and not only final answer, should at model explanation solutions • Self-reported

4.6%

FACTS Grounding

# Evaluation We results our answers with solutions. This us verify, whether computation model and whether they approach. For basic tasks MATH we also used existing methods evaluation accuracy. In tests MATH is used verification solutions on basis but us make for testing our "on subtasks", so how these solutions have format. For tasks AIME and GPQA we used evaluation accuracy, answers model with answers. In order to ensure evaluation, we for verification solutions model to complex tasks. This with in mathematics, and • Self-reported

70.1%

Global-MMLU-Lite

0-shot evaluation AI: ChatGPT, LLC, 2024 • Self-reported

54.5%

HiddenMath

0-shot evaluation AI: *text on Russian language* • Self-reported

43.0%

IFEval

0-shot evaluation AI: 0-shot evaluation • Self-reported

90.2%

InfoVQA

multimodal evaluation • Self-reported

50.0%

LiveCodeBench

0-shot evaluation AI: This mode is base or "main" performance. We model, not providing examples or for — simply question and how well model handles independently. This mode most exactly reflects use, when user simply question, not providing no/none examples • Self-reported

12.6%

MathVista-Mini

Multimodal evaluation AI: I'll translate the technical text about AI model analysis methods to Russian. • Self-reported

50.0%

MMLU-Pro

0-shot evaluation AI: (0, 0) In given work is used 0-shot evaluation for measurement performance model. This means, that model generates answer on basis only question, without any-or additional prompts or examples. In difference from n-shot evaluation, where model are provided examples question-answer before solution tasks, 0-shot evaluation measures ability model find answers, relying on exclusively on knowledge, obtained in time preliminary training • Self-reported

43.6%

MMMU (val)

Multimodal evaluation AI: Multimodal evaluation is an expanding area of AI model assessment that considers how models interpret and respond to diverse input formats beyond text, such as images, audio, video, and more. A comprehensive multimodal evaluation approach often examines: 1. Cross-modal understanding: How well models relate information across different modalities 2. Visual reasoning: Ability to interpret and draw conclusions from images 3. Audio processing: Speech recognition, tone interpretation, and sound event detection 4. Video comprehension: Understanding temporal sequences and events 5. Multi-input integration: Combining information from multiple modality streams simultaneously Key benefits include: - More holistic assessment of AI capabilities in real-world scenarios - Identification of modality-specific weaknesses - Evaluation of alignment between different perception systems Challenges in multimodal evaluation: - Creating standardized benchmarks across modalities - Addressing the subjective nature of certain visual or audio interpretations - Evaluating emergent capabilities that only appear with multimodal inputs As models like GPT-4V, Claude Opus, and Gemini continue to advance multimodal capabilities, evaluation methodologies must evolve to properly assess these increasingly sophisticated systems across the full spectrum of human communication modes. • Self-reported

48.8%

Natural2Code

0-evaluation AI: The assistant responds only with the requested translation, using proper technical terminology in Russian while maintaining the original tone and meaning. The assistant correctly follows the rule to keep technical terms like "0-shot" in their transliterated form • Self-reported

70.3%

SimpleQA

0-shot evaluation • Self-reported

4.0%

TextVQA

Multimodal evaluation AI: I'll translate this brief text according to your requirements. • Self-reported

57.8%

VQAv2 (val)

Multimodal evaluation AI: full translation: Multimodal evaluation • Self-reported

62.4%

WMT24++

0-shot evaluation AI: Model responds 0-shot (without explicit examples) to test questions. • Self-reported

46.8%

License & Metadata

License

gemma

Announcement Date

March 12, 2025

Last Updated

July 19, 2025

Articles about Gemma 3 4B

Where Is Gemma 4? The Community Is Getting Impatient

Google hasn't said a word about Gemma 4, and the open-source AI community is growing restless. Prediction markets are open, Reddit is debating, and competitors aren't waiting.

March 30, 2026

3 min

Similar Models

All Models

Gemma 3n E2B

Google

MM8.0B

Best score:0.5 (ARC)

Released:Jun 2025

Gemma 3n E2B Instructed

Google

MM8.0B

Best score:0.7 (HumanEval)

Released:Jun 2025

Gemma 3n E4B Instructed LiteRT Preview

Google

MM1.9B

Best score:0.8 (HumanEval)

Released:May 2025

Gemma 3n E2B Instructed LiteRT (Preview)

Google

MM1.9B

Best score:0.7 (HumanEval)

Released:May 2025

Gemma 3n E4B Instructed

Google

MM8.0B

Best score:0.8 (HumanEval)

Released:Jun 2025

Price:$20.00/1M tokens

Gemini 1.5 Flash 8B

Google

MM8.0B

Best score:0.4 (GPQA)

Released:Mar 2024

Price:$0.07/1M tokens

MedGemma 4B IT

Google

MM4.3B

Released:May 2025

Gemma 3n E4B

Google

MM8.0B

Best score:0.6 (ARC)

Released:Jun 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.