Gemma 3 12B

Name: Gemma 3 12B
Author: Google

Multimodal

Google

Gemma 3 12B is a vision-language model from Google with 12 billion parameters that processes text and visual input and generates text output. The model has a 128K context window, multi-language support, and open weights. Suitable for question answering, summarization, reasoning, and image understanding tasks.

Key Specifications

Parameters

12.0B

Context

131.1K

Release Date

March 12, 2025

Average Score

62.5%

Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

March 12, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

12.0B

Training Tokens

12.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.05

Output (per 1M tokens)

$0.10

Max Input Tokens

131.1K

Max Output Tokens

131.1K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

HumanEval

0-shot evaluation AI : KPI: on query user without additional information. Examples queries: - "metric for evaluation that, how well effectively in " - "How would you 'from answer'?" Method: - one from two models (version GPT) for comparison. - query 0-shot one from models. - quality answer. - that indeed query other model. - quality answer. - which model more answer. : - GPT-4 with Chain-of-Thought [mode thinking] usually outperforms more GPT-version without Chain-of-Thought. - Models often by-queries. that, how model can be more than simply evaluation "correctness" • Self-reported

85.4%

MBPP

3-shot evaluation In this we model 3 example before that, how ask her/its solve task. Examples represent itself tasks, target task, and each example solution, which we model. For example, at use 3-shot evaluation for tasks MMLU, we with model, showing it 3 question from that indeed domain field, that and target question, each example correct answer. Then we target question and we ask model give answer. 3-shot evaluation can improve performance model by comparison with evaluation (0-shot), since she/it gives model more context about that, which answer we and also about format tasks. This approach also gives model representation about specific domain field, that can be especially useful for specialized fields knowledge • Self-reported

73.0%

Mathematics

Mathematical problems and computations

GSM8k

0-evaluation AI: I'll translate this text about an evaluation method. Zero-shot evaluation • Self-reported

94.4%

MATH

0-shot evaluation Evaluation without preliminary examples • Self-reported

83.8%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

0-shot evaluation AI • Self-reported

85.7%

GPQA

Evaluation with training "diamond" Tasks can evaluate by two : 1. answer (how well should be answer?) 2. (how verify correctness answer?) This method on tasks, which require answer and have correct answer, which easily verify. These tasks represent itself "" in tasks, since they: - about performance model - for with help prompts - conduct evaluation Examples tasks for evaluation "diamond" include: - Tasks by mathematics with answers - Tasks on reasoning with multiple choice - with answers - Tasks on output with answer In order to verify capabilities LLM, we should on tasks "diamond", where modern model often for example: complex mathematics, programming, knowledge domain field and reasoning • Self-reported

40.9%

Multimodal

Working with images and visual data

AI2D

Multimodal evaluation AI: I'll translate this technical term about AI model analysis method. • Self-reported

84.2%

ChartQA

Multimodal evaluation AI: I'll translate the text about multimodal evaluation into Russian: multimodal evaluation • Self-reported

75.7%

DocVQA

Multimodal evaluation AI: you should only text. "multimodal evaluation" I how "Multimodal evaluation". This exact translation which is used in context evaluation models AI, work with different data (text, images and etc.etc.) • Self-reported

87.1%

Other Tests

Specialized benchmarks

BIG-Bench Extra Hard

0-shot evaluation AI: The prompt "AI:" is inserted at the end of the test case. No other instructions or examples are given to the model. The model must generate the correct answer with no additional guidance. • Self-reported

16.3%

Bird-SQL (dev)

# Evaluation method evaluation tools computations. We we can measure accuracy and use resources in different tasks computations. We we consider following computational tasks: ****: We we use GPQA, Ceval Math, NaturalProofs, GSM-8K, MATH and MathQA. ****: We we use MBPP, HumanEval and APPS. **computation**: We we use Physics GSM, Thinking Machine, and Chemistry. tasks have different evaluation, in that accuracy and Pass@k • Self-reported

47.9%

ECLeKTic

0-shot evaluation AI Translate this from English to Russian: 0-shot evaluation • Self-reported

10.3%

FACTS Grounding

# Evaluation For evaluation mathematical reasoning and models we we use tasks from by mathematics, including AIME (American Invitational Mathematics Examination), AMC (American Mathematics Competitions), FrontierMath, GPQA (GSM Proof Question Answering) and Harvard-MIT Mathematics Tournament. Although various evaluation such answers, for we we use in mainly accuracy on level tasks for assignments with answers and we compare with other models, using mode thinking, when this For problems with steps (for example, GPQA) we we use more complex methods evaluation, which correctness intermediate reasoning and steps evidence • Self-reported

75.8%

Global-MMLU-Lite

0-shot evaluation AI • Self-reported

69.5%

HiddenMath

0-shot evaluation AI : 1 • Self-reported

54.5%

IFEval

0-shot evaluation AI model Model: GPT-4o () API Temperature: 0.0 Method: We ability model solve complex mathematical tasks, when it is provided context with level, for new In order to for complex tasks, we used 20 tasks from mathematical competitions (AIME, FrontierMath, Harvard-MIT Mathematics Tournament), which require knowledge and/or skills, not for tasks, which in by mathematics. When testing without access to at model solution tasks in 0-shot format, at this for each tasks description problems, and model should was answer in form, which can evaluate human. success: For AIME tasks, where answer usually is number from 0 to 999, solution is considered correct, if final answer (evaluation : correctly/incorrectly). In capacity our main we we use accuracy answers on tasks at 0-shot evaluation • Self-reported

88.9%

InfoVQA

multimodal evaluation • Self-reported

64.9%

LiveCodeBench

0-shot evaluation AI : Let us on inside which around with ω. for r from to for m for k and for r_0 In order to find we should all on him : 1. : F_g = mg, by z. 2. : F_s = -k(r - r_0), to 3. : F_c = mω²r, from In should be Since can only r. Therefore we we consider only : F_r = F_s + F_c = -k(r - r_0) + mω²r = 0 for r: -k(r - r_0) + mω²r = 0 -kr + kr_0 + mω²r = 0 r(mω² - k) = -kr_0 r = kr_0/(k - mω²) This that if mω² > k, then will that in given task. This means, that at sufficiently speed not exists, and will from • Self-reported

24.6%

MathVista-Mini

Multimodal evaluation AI: Translate on Russian language fully following text: # Testing LLM Multimodal Capabilities The ML Safety Report has created a comprehensive multimodal evaluation benchmark to test AI models across a range of modalities. This includes evaluating their ability to: 1. **Process Images**: Can the model properly interpret image content? 2. **Analyze Visual Data**: Does the model extract meaningful information from charts, graphs, and visual representations? 3. **Understand Text in Images**: How well does the model read and comprehend text embedded in images? 4. **Interpret Diagrams**: Can the model correctly understand technical diagrams, maps, and schematics? 5. **Process Code Screenshots**: How effectively does the model interpret screenshots of code? Our benchmark includes rigorous test cases with precisely controlled prompts and images to ensure fair comparison across models. This allows us to compare multimodal capabilities across different AI systems and track progress over time. • Self-reported

62.9%

MMLU-Pro

0-shot evaluation • Self-reported

60.6%

MMMU (val)

multimodal evaluation • Self-reported

59.6%

Natural2Code

0-shot evaluation AI: ChatGPT GPT-4 Turbo's performance on a complex integral calculus problem from the MIT Integration Bee hints at continued weaknesses in the domain of mathematical reasoning. The model is challenged with a difficult but approachable integration problem: ∫ 1/√(1+x^3) dx. This specific integral has a standard approach in calculus, though the execution requires careful substitution and algebraic manipulation. When prompted in a 0-shot context to solve this integral, GPT-4 Turbo produces a solution that appears superficially correct but contains critical errors. The model attempts a u-substitution approach but makes algebraic mistakes and arrives at an incorrect final answer. The most concerning aspect is the model's confident presentation of the flawed solution without any indication of uncertainty. The mathematical steps appear coherent to a casual observer but fail under careful inspection by someone versed in calculus. This example highlights that despite impressive capabilities across many domains, GPT-4 Turbo still struggles with mathematical reasoning that requires precise application of calculus techniques and careful algebraic manipulation. The model's tendency to present incorrect mathematical derivations with high confidence suggests continued limitations in this domain. • Self-reported

80.7%

SimpleQA

0-shot evaluation • Self-reported

6.3%

TextVQA

Multimodal evaluation AI: I should translate the given text about multimodal evaluation. Let me translate this accurately using proper Russian technical terminology. • Self-reported

67.7%

VQAv2 (val)

Multimodal evaluation AI: Let me translate the provided text about multimodal evaluation. • Self-reported

71.6%

WMT24++

0-shot evaluation Evaluation 0-shot relates to to that, how model performs task without any-or preliminary examples for this specific tasks. This evaluation gives representation about that, how well well model can knowledge from its training, when with new assignments. Such method evaluation especially useful for understanding capabilities model at work with in world, where often provide examples before each query • Self-reported

51.6%

License & Metadata

License

gemma

Announcement Date

March 12, 2025

Last Updated

July 19, 2025

Articles about Gemma 3 12B

Where Is Gemma 4? The Community Is Getting Impatient

Google hasn't said a word about Gemma 4, and the open-source AI community is growing restless. Prediction markets are open, Reddit is debating, and competitors aren't waiting.

March 30, 2026

3 min

Google's TurboQuant Compresses AI Models to 2.5 Bits Without Breaking Them

A new quantization method from Google Research achieves 4.9x KV cache compression with zero accuracy loss. No training required — it works on any model instantly.

March 26, 2026

7 min

Similar Models

All Models

Gemma 3 27B

Google

MM27.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Price:$0.11/1M tokens

Gemma 3n E4B

Google

MM8.0B

Best score:0.6 (ARC)

Released:Jun 2025

Gemini 1.5 Flash

Google

Best score:0.8 (MMLU)

Released:May 2024

Price:$0.15/1M tokens

Gemini 2.0 Flash

Google

Best score:0.6 (GPQA)

Released:Dec 2024

Price:$0.10/1M tokens

Gemma 2 27B

Google

27.2B

Best score:0.8 (MMLU)

Released:Jun 2024

Llama 3.2 90B Instruct

Mistral Small 3.1 24B Base

Mistral AI

MM24.0B

Best score:0.8 (MMLU)

Released:Mar 2025

Price:$0.10/1M tokens

Gemma 3n E2B Instructed LiteRT (Preview)

Google

MM1.9B

Best score:0.7 (HumanEval)

Released:May 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.