Google logo

MedGemma 4B IT

Multimodal
Google

MedGemma is a collection of Gemma 3 variants trained for medical text processing and image understanding. MedGemma 4B uses a SigLIP image encoder that was specifically pre-trained on diverse de-identified medical data, including chest X-rays, dermatological images, ophthalmological images, and histopathological slides. Its LLM component is trained on a diverse medical dataset including radiological images, histopathological fragments, ophthalmological images, and dermatological images. MedGemma is a multimodal model primarily evaluated on single-image tasks. It has not been tested for multi-turn applications and may be more sensitive to specific prompts than its predecessor Gemma 3. Developers should consider biases in validation data and data contamination issues when using MedGemma.

Key Specifications

Parameters
4.3B
Context
-
Release Date
May 20, 2025
Average Score
58.5%

Timeline

Key dates in the model's history
Announcement
May 20, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
4.3B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Other Tests

Specialized benchmarks
CheXpert CXR
score F1 for 5 mainSelf-reported
48.1%
DermMCQA
Accuracy AI: AccuracySelf-reported
71.8%
MedXpertQA
Accuracy AI: Anthropic's Claude 3 Opus is a step-change improvement over our previous models on accuracy across a range of downstream benchmarks, from standard industry benchmarks to new challenges. For measures based on popular subjects and common knowledge, Claude 3 Opus has closed much of the gap to human performance. For tasks requiring extensive subject-matter expertise, Claude 3 Opus demonstrates significant improvements while also indicating the potential for further progress.Self-reported
18.8%
MIMIC CXR
F1 for 5Self-reported
88.9%
PathMCQA
Accuracy AI models are trained to predict real-world data, and we can evaluate them on how accurately they make these predictions. Accuracy is a measure of how often the model gets the right answer—the higher, the better. Accuracy can be assessed by comparing a model's outputs with human-generated reference outputs (on tasks like translation, summarization) or with objectively correct answers (on tasks like math problems, multiple-choice questions). For many tasks, accuracy is straightforward to measure. For instance, did the model get the right answer to a math question? But for some tasks, accuracy can be subjective—was the model's summary of a document actually good? In these cases, human evaluation is often necessary. AI labs typically evaluate accuracy across a variety of benchmarks (standardized test datasets) designed to test different types of abilities—including common sense reasoning, factual knowledge, logic, problem solving, and so on. Models can make different types of errors. They may hallucinate (making up information not supported by the input), be imprecise in their answers, lack necessary knowledge, make reasoning errors, or misunderstand the user's request.Self-reported
69.8%
SlakeVQA
Tokenized F1 We also we analyze quality answers models, their with reference answers. Standard metric F1 tokens between answer and However for our tasks on reasoning such approach since in can lead to to F1, even if answer We we use more approach, which we F1. First we answer and we verify, is whether answer exact with If match answer receives evaluation 1. In case we answer from text (for example, "5" from "answer: 5") and we verify, matches whether this value answer. If match answer receives evaluation 0.5. In case answer receives evaluation 0. F1 more exactly reflects correctness answers on tasks reasoning, than comparison text, and allows us evaluate performance models more mannerSelf-reported
62.3%
VQA-Rad
Tokenized F1 performance model on tasks represents itself complex problem, especially when or differences can not on answer. answer model "with x x²" and answer "this x in ". metrics, such how exact F1, could would evaluate this how very match, since these answers have general words, although they We developed approach "Tokenized F1" for solutions this problems. We we use model LLM for how so and answers on set Then we we compute F1 between tokens. In above example can give {_x_in_} and {_is, x_in_}, that gives F1 = 0.8, better their This approach ensures more than metrics, and is more than full evaluation. Research show, that Tokenized F1 better with evaluationthan standard metrics, such how BLEU or F1, at efficiency computationsSelf-reported
49.9%

License & Metadata

License
health_ai_developer_foundations_terms_of_use
Announcement Date
May 20, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.