MedGemma 4B IT

Name: MedGemma 4B IT
Author: Google

Multimodal

Google

MedGemma is a collection of Gemma 3 variants trained for medical text processing and image understanding. MedGemma 4B uses a SigLIP image encoder that was specifically pre-trained on diverse de-identified medical data, including chest X-rays, dermatological images, ophthalmological images, and histopathological slides. Its LLM component is trained on a diverse medical dataset including radiological images, histopathological fragments, ophthalmological images, and dermatological images. MedGemma is a multimodal model primarily evaluated on single-image tasks. It has not been tested for multi-turn applications and may be more sensitive to specific prompts than its predecessor Gemma 3. Developers should consider biases in validation data and data contamination issues when using MedGemma.

Key Specifications

Parameters

4.3B

Context

Release Date

May 20, 2025

Average Score

58.5%

API Documentation Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

May 20, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

4.3B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Other Tests

Specialized benchmarks

CheXpert CXR

score F1 for 5 main • Self-reported

48.1%

DermMCQA

Accuracy AI: Accuracy • Self-reported

71.8%

MedXpertQA

Accuracy AI: Anthropic's Claude 3 Opus is a step-change improvement over our previous models on accuracy across a range of downstream benchmarks, from standard industry benchmarks to new challenges. For measures based on popular subjects and common knowledge, Claude 3 Opus has closed much of the gap to human performance. For tasks requiring extensive subject-matter expertise, Claude 3 Opus demonstrates significant improvements while also indicating the potential for further progress. • Self-reported

18.8%

MIMIC CXR

F1 for 5 • Self-reported

88.9%

PathMCQA

Accuracy AI models are trained to predict real-world data, and we can evaluate them on how accurately they make these predictions. Accuracy is a measure of how often the model gets the right answer—the higher, the better. Accuracy can be assessed by comparing a model's outputs with human-generated reference outputs (on tasks like translation, summarization) or with objectively correct answers (on tasks like math problems, multiple-choice questions). For many tasks, accuracy is straightforward to measure. For instance, did the model get the right answer to a math question? But for some tasks, accuracy can be subjective—was the model's summary of a document actually good? In these cases, human evaluation is often necessary. AI labs typically evaluate accuracy across a variety of benchmarks (standardized test datasets) designed to test different types of abilities—including common sense reasoning, factual knowledge, logic, problem solving, and so on. Models can make different types of errors. They may hallucinate (making up information not supported by the input), be imprecise in their answers, lack necessary knowledge, make reasoning errors, or misunderstand the user's request. • Self-reported

69.8%

SlakeVQA

Tokenized F1 We also we analyze quality answers models, their with reference answers. Standard metric F1 tokens between answer and However for our tasks on reasoning such approach since in can lead to to F1, even if answer We we use more approach, which we F1. First we answer and we verify, is whether answer exact with If match answer receives evaluation 1. In case we answer from text (for example, "5" from "answer: 5") and we verify, matches whether this value answer. If match answer receives evaluation 0.5. In case answer receives evaluation 0. F1 more exactly reflects correctness answers on tasks reasoning, than comparison text, and allows us evaluate performance models more manner • Self-reported

62.3%

VQA-Rad

Tokenized F1 performance model on tasks represents itself complex problem, especially when or differences can not on answer. answer model "with x x²" and answer "this x in ". metrics, such how exact F1, could would evaluate this how very match, since these answers have general words, although they We developed approach "Tokenized F1" for solutions this problems. We we use model LLM for how so and answers on set Then we we compute F1 between tokens. In above example can give {_x_in_} and {_is, x_in_}, that gives F1 = 0.8, better their This approach ensures more than metrics, and is more than full evaluation. Research show, that Tokenized F1 better with evaluationthan standard metrics, such how BLEU or F1, at efficiency computations • Self-reported

49.9%

License & Metadata

License

health_ai_developer_foundations_terms_of_use

Announcement Date

May 20, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Gemini 1.5 Flash 8B

Google

MM8.0B

Best score:0.4 (GPQA)

Released:Mar 2024

Price:$0.07/1M tokens

Gemma 3n E2B

Google

MM8.0B

Best score:0.5 (ARC)

Released:Jun 2025

Gemma 3n E2B Instructed

Google

MM8.0B

Best score:0.7 (HumanEval)

Released:Jun 2025

Gemma 3 4B

Google

MM4.0B

Best score:0.7 (HumanEval)

Released:Mar 2025

Price:$0.02/1M tokens

Gemma 3n E4B Instructed LiteRT Preview

Google

MM1.9B

Best score:0.8 (HumanEval)

Released:May 2025

Gemma 3n E2B Instructed LiteRT (Preview)

Google

MM1.9B

Best score:0.7 (HumanEval)

Released:May 2025

Gemma 3n E4B Instructed

Google

MM8.0B

Best score:0.8 (HumanEval)

Released:Jun 2025

Price:$20.00/1M tokens

Gemma 3n E4B

Google

MM8.0B

Best score:0.6 (ARC)

Released:Jun 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.