Google logo

Gemma 3 12B

Multimodal
Google

Gemma 3 12B is a vision-language model from Google with 12 billion parameters that processes text and visual input and generates text output. The model has a 128K context window, multi-language support, and open weights. Suitable for question answering, summarization, reasoning, and image understanding tasks.

Key Specifications

Parameters
12.0B
Context
131.1K
Release Date
March 12, 2025
Average Score
62.5%

Timeline

Key dates in the model's history
Announcement
March 12, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
12.0B
Training Tokens
12.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.05
Output (per 1M tokens)
$0.10
Max Input Tokens
131.1K
Max Output Tokens
131.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests
HumanEval
0-shot evaluation AI : KPI: on query user without additional information. Examples queries: - "metric for evaluation that, how well effectively in " - "How would you 'from answer'?" Method: - one from two models (version GPT) for comparison. - query 0-shot one from models. - quality answer. - that indeed query other model. - quality answer. - which model more answer. : - GPT-4 with Chain-of-Thought [mode thinking] usually outperforms more GPT-version without Chain-of-Thought. - Models often by-queries. that, how model can be more than simply evaluation "correctness"Self-reported
85.4%
MBPP
3-shot evaluation In this we model 3 example before that, how ask her/its solve task. Examples represent itself tasks, target task, and each example solution, which we model. For example, at use 3-shot evaluation for tasks MMLU, we with model, showing it 3 question from that indeed domain field, that and target question, each example correct answer. Then we target question and we ask model give answer. 3-shot evaluation can improve performance model by comparison with evaluation (0-shot), since she/it gives model more context about that, which answer we and also about format tasks. This approach also gives model representation about specific domain field, that can be especially useful for specialized fields knowledgeSelf-reported
73.0%

Mathematics

Mathematical problems and computations
GSM8k
0-evaluation AI: I'll translate this text about an evaluation method. Zero-shot evaluationSelf-reported
94.4%
MATH
0-shot evaluation Evaluation without preliminary examplesSelf-reported
83.8%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
0-shot evaluation AISelf-reported
85.7%
GPQA
Evaluation with training "diamond" Tasks can evaluate by two : 1. answer (how well should be answer?) 2. (how verify correctness answer?) This method on tasks, which require answer and have correct answer, which easily verify. These tasks represent itself "" in tasks, since they: - about performance model - for with help prompts - conduct evaluation Examples tasks for evaluation "diamond" include: - Tasks by mathematics with answers - Tasks on reasoning with multiple choice - with answers - Tasks on output with answer In order to verify capabilities LLM, we should on tasks "diamond", where modern model often for example: complex mathematics, programming, knowledge domain field and reasoningSelf-reported
40.9%

Multimodal

Working with images and visual data
AI2D
Multimodal evaluation AI: I'll translate this technical term about AI model analysis method.Self-reported
84.2%
ChartQA
Multimodal evaluation AI: I'll translate the text about multimodal evaluation into Russian: multimodal evaluationSelf-reported
75.7%
DocVQA
Multimodal evaluation AI: you should only text. "multimodal evaluation" I how "Multimodal evaluation". This exact translation which is used in context evaluation models AI, work with different data (text, images and etc.etc.)Self-reported
87.1%

Other Tests

Specialized benchmarks
BIG-Bench Extra Hard
0-shot evaluation AI: The prompt "AI:" is inserted at the end of the test case. No other instructions or examples are given to the model. The model must generate the correct answer with no additional guidance.Self-reported
16.3%
Bird-SQL (dev)
# Evaluation method evaluation tools computations. We we can measure accuracy and use resources in different tasks computations. We we consider following computational tasks: ****: We we use GPQA, Ceval Math, NaturalProofs, GSM-8K, MATH and MathQA. ****: We we use MBPP, HumanEval and APPS. **computation**: We we use Physics GSM, Thinking Machine, and Chemistry. tasks have different evaluation, in that accuracy and Pass@kSelf-reported
47.9%
ECLeKTic
0-shot evaluation AI Translate this from English to Russian: 0-shot evaluationSelf-reported
10.3%
FACTS Grounding
# Evaluation For evaluation mathematical reasoning and models we we use tasks from by mathematics, including AIME (American Invitational Mathematics Examination), AMC (American Mathematics Competitions), FrontierMath, GPQA (GSM Proof Question Answering) and Harvard-MIT Mathematics Tournament. Although various evaluation such answers, for we we use in mainly accuracy on level tasks for assignments with answers and we compare with other models, using mode thinking, when this For problems with steps (for example, GPQA) we we use more complex methods evaluation, which correctness intermediate reasoning and steps evidenceSelf-reported
75.8%
Global-MMLU-Lite
0-shot evaluation AISelf-reported
69.5%
HiddenMath
0-shot evaluation AI : 1Self-reported
54.5%
IFEval
0-shot evaluation AI model Model: GPT-4o () API Temperature: 0.0 Method: We ability model solve complex mathematical tasks, when it is provided context with level, for new In order to for complex tasks, we used 20 tasks from mathematical competitions (AIME, FrontierMath, Harvard-MIT Mathematics Tournament), which require knowledge and/or skills, not for tasks, which in by mathematics. When testing without access to at model solution tasks in 0-shot format, at this for each tasks description problems, and model should was answer in form, which can evaluate human. success: For AIME tasks, where answer usually is number from 0 to 999, solution is considered correct, if final answer (evaluation : correctly/incorrectly). In capacity our main we we use accuracy answers on tasks at 0-shot evaluationSelf-reported
88.9%
InfoVQA
multimodal evaluationSelf-reported
64.9%
LiveCodeBench
0-shot evaluation AI : Let us on inside which around with ω. for r from to for m for k and for r_0 In order to find we should all on him : 1. : F_g = mg, by z. 2. : F_s = -k(r - r_0), to 3. : F_c = mω²r, from In should be Since can only r. Therefore we we consider only : F_r = F_s + F_c = -k(r - r_0) + mω²r = 0 for r: -k(r - r_0) + mω²r = 0 -kr + kr_0 + mω²r = 0 r(mω² - k) = -kr_0 r = kr_0/(k - mω²) This that if mω² > k, then will that in given task. This means, that at sufficiently speed not exists, and will fromSelf-reported
24.6%
MathVista-Mini
Multimodal evaluation AI: Translate on Russian language fully following text: # Testing LLM Multimodal Capabilities The ML Safety Report has created a comprehensive multimodal evaluation benchmark to test AI models across a range of modalities. This includes evaluating their ability to: 1. **Process Images**: Can the model properly interpret image content? 2. **Analyze Visual Data**: Does the model extract meaningful information from charts, graphs, and visual representations? 3. **Understand Text in Images**: How well does the model read and comprehend text embedded in images? 4. **Interpret Diagrams**: Can the model correctly understand technical diagrams, maps, and schematics? 5. **Process Code Screenshots**: How effectively does the model interpret screenshots of code? Our benchmark includes rigorous test cases with precisely controlled prompts and images to ensure fair comparison across models. This allows us to compare multimodal capabilities across different AI systems and track progress over time.Self-reported
62.9%
MMLU-Pro
0-shot evaluationSelf-reported
60.6%
MMMU (val)
multimodal evaluationSelf-reported
59.6%
Natural2Code
0-shot evaluation AI: ChatGPT GPT-4 Turbo's performance on a complex integral calculus problem from the MIT Integration Bee hints at continued weaknesses in the domain of mathematical reasoning. The model is challenged with a difficult but approachable integration problem: ∫ 1/√(1+x^3) dx. This specific integral has a standard approach in calculus, though the execution requires careful substitution and algebraic manipulation. When prompted in a 0-shot context to solve this integral, GPT-4 Turbo produces a solution that appears superficially correct but contains critical errors. The model attempts a u-substitution approach but makes algebraic mistakes and arrives at an incorrect final answer. The most concerning aspect is the model's confident presentation of the flawed solution without any indication of uncertainty. The mathematical steps appear coherent to a casual observer but fail under careful inspection by someone versed in calculus. This example highlights that despite impressive capabilities across many domains, GPT-4 Turbo still struggles with mathematical reasoning that requires precise application of calculus techniques and careful algebraic manipulation. The model's tendency to present incorrect mathematical derivations with high confidence suggests continued limitations in this domain.Self-reported
80.7%
SimpleQA
0-shot evaluationSelf-reported
6.3%
TextVQA
Multimodal evaluation AI: I should translate the given text about multimodal evaluation. Let me translate this accurately using proper Russian technical terminology.Self-reported
67.7%
VQAv2 (val)
Multimodal evaluation AI: Let me translate the provided text about multimodal evaluation.Self-reported
71.6%
WMT24++
0-shot evaluation Evaluation 0-shot relates to to that, how model performs task without any-or preliminary examples for this specific tasks. This evaluation gives representation about that, how well well model can knowledge from its training, when with new assignments. Such method evaluation especially useful for understanding capabilities model at work with in world, where often provide examples before each querySelf-reported
51.6%

License & Metadata

License
gemma
Announcement Date
March 12, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.