Gemma 3 12B
MultimodalGemma 3 12B is a vision-language model from Google with 12 billion parameters that processes text and visual input and generates text output. The model has a 128K context window, multi-language support, and open weights. Suitable for question answering, summarization, reasoning, and image understanding tasks.
Key Specifications
Parameters
12.0B
Context
131.1K
Release Date
March 12, 2025
Average Score
62.5%
Timeline
Key dates in the model's history
Announcement
March 12, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
12.0B
Training Tokens
12.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.05
Output (per 1M tokens)
$0.10
Max Input Tokens
131.1K
Max Output Tokens
131.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Programming
Programming skills tests
HumanEval
0-shot evaluation AI : KPI: on query user without additional information. Examples queries: - "metric for evaluation that, how well effectively in " - "How would you 'from answer'?" Method: - one from two models (version GPT) for comparison. - query 0-shot one from models. - quality answer. - that indeed query other model. - quality answer. - which model more answer. : - GPT-4 with Chain-of-Thought [mode thinking] usually outperforms more GPT-version without Chain-of-Thought. - Models often by-queries. that, how model can be more than simply evaluation "correctness" • Self-reported
MBPP
3-shot evaluation In this we model 3 example before that, how ask her/its solve task. Examples represent itself tasks, target task, and each example solution, which we model. For example, at use 3-shot evaluation for tasks MMLU, we with model, showing it 3 question from that indeed domain field, that and target question, each example correct answer. Then we target question and we ask model give answer. 3-shot evaluation can improve performance model by comparison with evaluation (0-shot), since she/it gives model more context about that, which answer we and also about format tasks. This approach also gives model representation about specific domain field, that can be especially useful for specialized fields knowledge • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
0-evaluation AI: I'll translate this text about an evaluation method. Zero-shot evaluation • Self-reported
MATH
0-shot evaluation
Evaluation without preliminary examples • Self-reported
Reasoning
Logical reasoning and analysis
BIG-Bench Hard
0-shot evaluation
AI • Self-reported
GPQA
Evaluation with training "diamond" Tasks can evaluate by two : 1. answer (how well should be answer?) 2. (how verify correctness answer?) This method on tasks, which require answer and have correct answer, which easily verify. These tasks represent itself "" in tasks, since they: - about performance model - for with help prompts - conduct evaluation Examples tasks for evaluation "diamond" include: - Tasks by mathematics with answers - Tasks on reasoning with multiple choice - with answers - Tasks on output with answer In order to verify capabilities LLM, we should on tasks "diamond", where modern model often for example: complex mathematics, programming, knowledge domain field and reasoning • Self-reported
Multimodal
Working with images and visual data
AI2D
Multimodal evaluation
AI: I'll translate this technical term about AI model analysis method. • Self-reported
ChartQA
Multimodal evaluation
AI: I'll translate the text about multimodal evaluation into Russian:
multimodal evaluation • Self-reported
DocVQA
Multimodal evaluation AI: you should only text. "multimodal evaluation" I how "Multimodal evaluation". This exact translation which is used in context evaluation models AI, work with different data (text, images and etc.etc.) • Self-reported
Other Tests
Specialized benchmarks
BIG-Bench Extra Hard
0-shot evaluation
AI: The prompt "AI:" is inserted at the end of the test case. No other instructions or examples are given to the model. The model must generate the correct answer with no additional guidance. • Self-reported
Bird-SQL (dev)
# Evaluation method evaluation tools computations. We we can measure accuracy and use resources in different tasks computations. We we consider following computational tasks: ****: We we use GPQA, Ceval Math, NaturalProofs, GSM-8K, MATH and MathQA. ****: We we use MBPP, HumanEval and APPS. **computation**: We we use Physics GSM, Thinking Machine, and Chemistry. tasks have different evaluation, in that accuracy and Pass@k • Self-reported
ECLeKTic
0-shot evaluation
AI
Translate this from English to Russian:
0-shot evaluation • Self-reported
FACTS Grounding
# Evaluation For evaluation mathematical reasoning and models we we use tasks from by mathematics, including AIME (American Invitational Mathematics Examination), AMC (American Mathematics Competitions), FrontierMath, GPQA (GSM Proof Question Answering) and Harvard-MIT Mathematics Tournament. Although various evaluation such answers, for we we use in mainly accuracy on level tasks for assignments with answers and we compare with other models, using mode thinking, when this For problems with steps (for example, GPQA) we we use more complex methods evaluation, which correctness intermediate reasoning and steps evidence • Self-reported
Global-MMLU-Lite
0-shot evaluation
AI • Self-reported
HiddenMath
0-shot evaluation
AI
: 1 • Self-reported
IFEval
0-shot evaluation AI model Model: GPT-4o () API Temperature: 0.0 Method: We ability model solve complex mathematical tasks, when it is provided context with level, for new In order to for complex tasks, we used 20 tasks from mathematical competitions (AIME, FrontierMath, Harvard-MIT Mathematics Tournament), which require knowledge and/or skills, not for tasks, which in by mathematics. When testing without access to at model solution tasks in 0-shot format, at this for each tasks description problems, and model should was answer in form, which can evaluate human. success: For AIME tasks, where answer usually is number from 0 to 999, solution is considered correct, if final answer (evaluation : correctly/incorrectly). In capacity our main we we use accuracy answers on tasks at 0-shot evaluation • Self-reported
InfoVQA
multimodal evaluation • Self-reported
LiveCodeBench
0-shot evaluation AI : Let us on inside which around with ω. for r from to for m for k and for r_0 In order to find we should all on him : 1. : F_g = mg, by z. 2. : F_s = -k(r - r_0), to 3. : F_c = mω²r, from In should be Since can only r. Therefore we we consider only : F_r = F_s + F_c = -k(r - r_0) + mω²r = 0 for r: -k(r - r_0) + mω²r = 0 -kr + kr_0 + mω²r = 0 r(mω² - k) = -kr_0 r = kr_0/(k - mω²) This that if mω² > k, then will that in given task. This means, that at sufficiently speed not exists, and will from • Self-reported
MathVista-Mini
Multimodal evaluation
AI: Translate on Russian language fully following text:
# Testing LLM Multimodal Capabilities
The ML Safety Report has created a comprehensive multimodal evaluation benchmark to test AI models across a range of modalities. This includes evaluating their ability to:
1. **Process Images**: Can the model properly interpret image content?
2. **Analyze Visual Data**: Does the model extract meaningful information from charts, graphs, and visual representations?
3. **Understand Text in Images**: How well does the model read and comprehend text embedded in images?
4. **Interpret Diagrams**: Can the model correctly understand technical diagrams, maps, and schematics?
5. **Process Code Screenshots**: How effectively does the model interpret screenshots of code?
Our benchmark includes rigorous test cases with precisely controlled prompts and images to ensure fair comparison across models. This allows us to compare multimodal capabilities across different AI systems and track progress over time. • Self-reported
MMLU-Pro
0-shot evaluation • Self-reported
MMMU (val)
multimodal evaluation • Self-reported
Natural2Code
0-shot evaluation
AI: ChatGPT
GPT-4 Turbo's performance on a complex integral calculus problem from the MIT Integration Bee hints at continued weaknesses in the domain of mathematical reasoning.
The model is challenged with a difficult but approachable integration problem: ∫ 1/√(1+x^3) dx. This specific integral has a standard approach in calculus, though the execution requires careful substitution and algebraic manipulation.
When prompted in a 0-shot context to solve this integral, GPT-4 Turbo produces a solution that appears superficially correct but contains critical errors. The model attempts a u-substitution approach but makes algebraic mistakes and arrives at an incorrect final answer.
The most concerning aspect is the model's confident presentation of the flawed solution without any indication of uncertainty. The mathematical steps appear coherent to a casual observer but fail under careful inspection by someone versed in calculus.
This example highlights that despite impressive capabilities across many domains, GPT-4 Turbo still struggles with mathematical reasoning that requires precise application of calculus techniques and careful algebraic manipulation. The model's tendency to present incorrect mathematical derivations with high confidence suggests continued limitations in this domain. • Self-reported
SimpleQA
0-shot evaluation • Self-reported
TextVQA
Multimodal evaluation
AI: I should translate the given text about multimodal evaluation. Let me translate this accurately using proper Russian technical terminology. • Self-reported
VQAv2 (val)
Multimodal evaluation
AI: Let me translate the provided text about multimodal evaluation. • Self-reported
WMT24++
0-shot evaluation Evaluation 0-shot relates to to that, how model performs task without any-or preliminary examples for this specific tasks. This evaluation gives representation about that, how well well model can knowledge from its training, when with new assignments. Such method evaluation especially useful for understanding capabilities model at work with in world, where often provide examples before each query • Self-reported
License & Metadata
License
gemma
Announcement Date
March 12, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsGemma 3 27B
MM27.0B
Best score:0.9 (HumanEval)
Released:Mar 2025
Price:$0.11/1M tokens
Gemma 3n E4B
MM8.0B
Best score:0.6 (ARC)
Released:Jun 2025
Gemini 1.5 Flash
MM
Best score:0.8 (MMLU)
Released:May 2024
Price:$0.15/1M tokens
Gemini 2.0 Flash
MM
Best score:0.6 (GPQA)
Released:Dec 2024
Price:$0.10/1M tokens
Gemma 2 27B
27.2B
Best score:0.8 (MMLU)
Released:Jun 2024
Llama 3.2 90B Instruct
Meta
MM90.0B
Best score:0.9 (MMLU)
Released:Sep 2024
Price:$1.20/1M tokens
Mistral Small 3.1 24B Base
Mistral AI
MM24.0B
Best score:0.8 (MMLU)
Released:Mar 2025
Price:$0.10/1M tokens
Gemma 3n E2B Instructed LiteRT (Preview)
MM1.9B
Best score:0.7 (HumanEval)
Released:May 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.