Page Summary
-
MedGemma is a collection of Gemma 3 variants trained for performance on medical text and image comprehension to accelerate the building of healthcare-based AI applications.
-
MedGemma comes in a 4B multimodal version, a 27B text-only version, and a 27B multimodal version, with multimodal versions utilizing a SigLIP image encoder trained on de-identified medical data.
-
MedGemma variants have been evaluated on clinically relevant benchmarks for baseline performance, and developers can fine-tune them for improved performance.
-
MedGemma is optimized for medical applications involving text generation, while MedSigLIP is recommended for medical image-based applications without text generation.
-
The use of MedGemma is governed by the Health AI Developer Foundations terms of use, and the models were trained on a combination of public and licensed de-identified datasets.
Model documentation: MedGemma
Resources:
- Model on Google Cloud Model Garden: MedGemma
- Models on Hugging Face: Collection
- Concept applications built using MedGemma: Collection
-
- Tutorial notebooks
- Custom container source code for server-side image processing. You can find it prebuilt as a Docker image based on a custom architecture with a convenient API. It is used to provide additional deployment options in Model Garden alongside the standard vLLM container.
License: The use of MedGemma is governed by the Health AI Developer Foundations terms of use.
Support channels
Author: Google
Model information
This section describes the specifications and recommended use of the MedGemma model.
Description
MedGemma is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension. Developers can use MedGemma to accelerate building healthcare-based AI applications.
MedGemma 1.5 4B is an updated version of the MedGemma 1 4B model.
MedGemma 1.5 4B expands support for several new medical imaging and data processing applications, including:
- High-dimensional medical imaging: Interpretation of three-dimensional volume representations of Computed Tomography (CT) and Magnetic Resonance Imaging (MRI).
- Whole-slide histopathology imaging (WSI): Simultaneous interpretation of multiple patches from a whole slide histopathology image as input.
- Longitudinal medical imaging: Interpretation of chest X-rays in the context of prior images (e.g., comparing current versus historical scans).
- Anatomical localization: Bounding box–based localization of anatomical features and findings in chest X-rays.
- Medical document understanding: Extraction of structured data, such as values and units, from unstructured medical lab reports.
- Electronic Health Record (EHR) understanding: Interpretation of text-based EHR data.
In addition to these new features, MedGemma 1.5 4B delivers improved accuracy on medical text reasoning and modest improvement on standard 2D image interpretation compared to MedGemma 1 4B.
MedGemma utilizes a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. The LLM component is trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data, 2D and 3D radiology images, histopathology images, ophthalmology images, dermatology images, and lab reports for document understanding.
MedGemma 1.5 4B has been evaluated on a range of clinically relevant benchmarks to illustrate its baseline performance. These evaluations are based on both open benchmark datasets and internally curated datasets. Developers are expected to fine-tune MedGemma for improved performance on their use case. Consult the Intended use section for more details.
MedGemma is optimized for medical applications that involve a text generation component. For medical image-based applications that do not involve text generation, such as data-efficient classification, zero-shot classification, or content-based or semantic image retrieval, the MedSigLIP image encoder is recommended. MedSigLIP is based on the same image encoder that powers MedGemma 1 and MedGemma 1.5.
How to use
The following are some example code snippets to help you quickly get started running the model locally on GPU.
First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.
$ pip install -U transformers
Next, use either the pipeline wrapper or the transformer API directly to send a chest X-ray image and a question to the model.
Note that CT, MRI and whole-slide histopathology images require some pre-processing; see the CT and WSI notebook for examples.
Run model with the pipeline API
from transformers import pipeline
from PIL import Image
import requests
import torch
pipe = pipeline(
"image-text-to-text",
model="google/medgemma-1.5-4b-it",
torch_dtype=torch.bfloat16,
device="cuda",
)
# Image attribution: Stillwaterising, CC0, via Wikimedia Commons
image_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Chest_Xray_PA_3-8-2010.png"
image = Image.open(requests.get(image_url, headers={"User-Agent": "example"}, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this X-ray"}
]
}
]
output = pipe(text=messages, max_new_tokens=2000)
print(output[0]["generated_text"][-1]["content"])
Run the model directly
# Make sure to install the accelerate library first via `pip install accelerate`
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import requests
import torch
model_id = "google/medgemma-1.5-4b-it"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
# Image attribution: Stillwaterising, CC0, via Wikimedia Commons
image_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Chest_Xray_PA_3-8-2010.png"
image = Image.open(requests.get(image_url, headers={"User-Agent": "example"}, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this X-ray"}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=2000, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Examples
Refer to the growing collection of tutorial notebooks to see how to use or fine-tune MedGemma.
Model architecture overview
The MedGemma model is built based on Gemma 3 and uses the same decoder-only transformer architecture as Gemma 3. To read more about the architecture, consult the Gemma 3 model card.
Technical specifications
- Model type: Decoder-only Transformer architecture, see the Gemma 3 Technical Report
- Input modalities: Text, vision (multimodal)
- Output modality: Text only
- Attention mechanism: Grouped-query attention (GQA)
- Context length: Supports long context, at least 128K tokens
- Key publication: https://arxiv.org/abs/2507.05201
- Model created: 4B multimodal: Jan 13, 2026
- Model version: 4B multimodal: 1.5.0
Citation
When using this model, please cite: Sellergren et al. "MedGemma Technical Report." arXiv preprint arXiv:2507.05201 (2025).
@article{sellergren2025medgemma,
title={MedGemma Technical Report},
author={Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and others},
journal={arXiv preprint arXiv:2507.05201},
year={2025}
}
Inputs and outputs
Input:
- Text string, such as a question or prompt
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
- Total input length of 128K tokens
Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
- Total output length of 8192 tokens
Performance and evaluations
MedGemma was evaluated across a range of different multimodal classification, report generation, visual question answering, and text-based tasks.
Key performance metrics
Imaging evaluations
The multimodal performance of MedGemma 1.5 4B was evaluated across a range of benchmarks, focusing on radiology (2D, longitudinal 2D, and 3D), dermatology, histopathology, ophthalmology, document understanding, and multimodal clinical reasoning. See Data card for details of individual datasets.
We also list the previous results for MedGemma 1 4B and 27B (multimodal models only), as well as for Gemma 3 4B for comparison.
| Task / Dataset | Metric | Gemma 3 4B | MedGemma 1 4B | MedGemma 1.5 4B | MedGemma 1 27B |
|---|---|---|---|---|---|
| 3D radiology image classification | |||||
| CT Dataset 1*(7 conditions/abnormalities) | Macro accuracy | 54.5 | 58.2 | 61.1 | 57.8 |
| CT-RATE (validation, 18 conditions/abnormalities ) | Macro F1 | 23.5 | 27.0 | ||
| Macro precision | 34.5 | 34.2 | |||
| Macro recall | 34.1 | 42.0 | |||
| MRI Dataset 1*(10 conditions/abnormalities) | Macro accuracy | 51.1 | 51.3 | 64.7 | 57.4 |
| 2D image classification | |||||
| MIMIC CXR** | Macro F1 (top 5 conditions) | 81.2 | 88.9 | 89.5 | 90.0 |
| CheXpert CXR | Macro F1 (top 5 conditions) | 32.6 | 48.1 | 48.2 | 49.9 |
| CXR14 | Macro F1 (3 conditions) | 32.0 | 50.1 | 48.4 | 45.3 |
| PathMCQA* (histopathology) | Accuracy | 37.1 | 69.8 | 70.0 | 71.6 |
| WSI-Path* (whole-slide histopathology) | ROUGE | 2.3 | 2.2 | 49.4 | 4.1 |
| US-DermMCQA* | Accuracy | 52.5 | 71.8 | 73.5 | 71.7 |
| EyePACS* (fundus) | Accuracy | 14.4 | 64.9 | 76.8 | 75.3 |
| Disease Progression Classification (Longitudinal) | |||||
| MS-CXR-T | Macro Accuracy | 59.0 | 61.11 | 65.7 | 50.1 |
| Visual question answering | |||||
| SLAKE (radiology) | Tokenized F1 | 40.2 | 72.3 | 59.7**** | 70.3 |
| Accuracy (on closed subset) | 62.0 | 87.6 | 82.8 | 85.9 | |
| VQA-RAD*** (radiology) | Tokenized F1 | 33.6 | 49.9 | 48.1 | 46.7 |
| Accuracy (on closed subset) | 42.1 | 69.1 | 70.2 | 67.1 | |
| Region of interest detection | |||||
| Chest ImaGenome: Anatomy bounding box detection | Intersection over union | 6.1 | 3.1 | 38.0 | 16.0 |
| Multimodal medical knowledge and reasoning | |||||
| MedXpertQA (text + multimodal questions) | Accuracy | 16.4 | 18.8 | 20.9 | 26.8 |
* Internal datasets. CT Dataset 1 and MRI Dataset 1 are described below -- for evaluation, perfectly balanced samples were drawn per condition. US-DermMCQA is described in Liu et al. (2020, Nature medicine), presented as a 4-way MCQ per example for skin condition classification. PathMCQA is based on multiple datasets, presented as 3-9 way MCQ per example for identification, grading, and subtype for breast, cervical, and prostate cancer. WSI-Path is a dataset of deidentified H&E WSIs and associated final diagnosis text from original pathology reports, comprising single WSI examples and previously described in Ahmed et al. (2024, arXiv). EyePACS is a dataset of fundus images with classification labels based on 5-level diabetic retinopathy severity (None, Mild, Moderate, Severe, Proliferative). A subset of these datasets are described in more detail in the MedGemma Technical Report.
** Based on radiologist adjudicated labels, described in Yang (2024, arXiv) Section A.1.1.
*** Based on "balanced split," described in Yang (2024, arXiv).
**** While MedGemma 1.5 4B exhibits strong radiology interpretation capabilities, it was less optimized for the SLAKE Q&A format compared to MedGemma 1 4B. Fine-tuning on SLAKE may improve results.
Chest X-ray report generation
MedGemma chest X-ray (CXR) report generation performance was evaluated on MIMIC-CXR using the RadGraph F1 metric. We compare MedGemma 1.5 4B against a fine-tuned version of MedGemma 1 4B, and the MedGemma 1 27B base model.
| Task / Dataset | Metric | MedGemma 1 4B (tuned for CXR) | MedGemma 1.5 4B | MedGemma 1 27B |
|---|---|---|---|---|
| Chest X-ray report generation | ||||
| MIMIC CXR - RadGraph F1 | 30.3 | 27.2 | 27.0 |
Text evaluations
MedGemma 1.5 4B was evaluated across a range of text-only benchmarks for medical knowledge and reasoning. Existing results for MedGemma 1 variants and Gemma 3 are shown for comparison.
| Dataset | Gemma 3 4B | MedGemma 1 4B | MedGemma 1.5 4B | MedGemma 1 27B |
|---|---|---|---|---|
| MedQA (4-op) | 50.7 | 64.4 | 69.1 | 85.3 |
| MedMCQA | 45.4 | 55.7 | 59.8 | 70.2 |
| PubMedQA | 68.4 | 73.4 | 68.2 | 77.2 |
| MMLU Med | 67.2 | 70.0 | 69.6 | 86.2 |
| MedXpertQA (text only) | 11.6 | 14.2 | 16.4 | 23.7 |
| AfriMed-QA (25 question test set) | 48.0 | 52.0 | 56.0 | 72.0 |
Medical record evaluations
EHR understanding and interpretation was evaluated for synthetic longitudinal text-based EHR data and real-world de-identified discharge summaries via question-answering benchmark datasets for MedGemma 1.5 4B, MedGemma 1 variants, and Gemma 3 4B.
| Dataset | Metric | Gemma 3 4B | MedGemma 1 4B | MedGemma 1.5 4B | MedGemma 1 27B |
|---|---|---|---|---|---|
| EHRQA* | Accuracy | 70.9 | 67.6 | 89.6 | 90.5 |
| EHRNoteQA | Accuracy | 78.0 | 79.4 | 80.4 | 90.7 |
* Internal dataset
Document understanding evaluations
Evaluation of converting unstructured medical lab reports documents (PDFs/images) into structured JSON data.
| Task / Dataset | Metric | Gemma 3 4B | MedGemma 1 4B | MedGemma 1.5 4B | MedGemma 1 27B |
|---|---|---|---|---|---|
| PDF-to-JSON Lab Test Data Conversion | |||||
| EHR Dataset 2* (raw PDF to JSON) | Macro F1 (average over per document F1 scores) | 84.0 | 78.0 | 91.0 | 76.0 |
| Micro F1 (F1 across all extracted data fields) | 81.0 | 75.0 | 88.0 | 70.0 | |
| EHR Dataset 3* (raw PDF to JSON) | Macro F1 | 61.0 | 50.0 | 71.0 | 66.0 |
| Micro F1 | 61.0 | 51.0 | 70.0 | 69.0 | |
| Mendeley Clinical Laboratory Test Reports (PNG image to JSON) | Macro F1 | 83.0 | 85.0 | 85.0 | 69.0 |
| Micro F1 | 78.0 | 81.0 | 83.0 | 68.0 | |
| EHR Dataset 4* | Macro F1 | 41.0 | 25.0 | 64.0 | |
| Micro F1 | 41.0 | 33.0 | 67.0 |
* Internal datasets.
Ethics and safety evaluation
Evaluation approach
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:
- Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation.
- Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech.
- Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies.
- General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and potentially harmful responses or inaccuracies.
In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team and inform decision making about release. High-level findings are fed back to the model team but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review.
Evaluation results
For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms compared to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts.
Data card
Dataset overview
Training
The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma multimodal variants utilize a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Their LLM component is trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images.
Evaluation
MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks across multiple datasets, tasks and modalities. These benchmarks include both open and internal datasets.
Source
MedGemma utilizes a combination of public and private datasets.
This model was trained on diverse public datasets including MIMIC-CXR (chest X-rays and reports), ChestImaGenome: Set of bounding boxes linking image findings with anatomical regions for MIMIC-CXR SLAKE (multimodal medical images and questions), PAD-UFES-20 (skin lesion images and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node histopathology images), PMC-OA (biomedical literature with images), and Mendeley Digital Knee X-Ray (knee X-rays).
Additionally, multiple diverse proprietary datasets were licensed and incorporated (described next).
Data ownership and documentation
- MIMIC-CXR: MIT Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center (BIDMC).
- MS-CXR-T: Microsoft Research Health Futures, Microsoft Research.
- ChestX-ray14: National Institutes of Health - Clinical Center.
- SLAKE: The Hong Kong Polytechnic University (PolyU), with collaborators including West China Hospital of Sichuan University and Sichuan Academy of Medical Sciences / Sichuan Provincial People's Hospital.
- PAD-UFES-20: Federal University of Espírito Santo (UFES), Brazil, through its Dermatological and Surgical Assistance Program (PAD).
- SCIN: A collaboration between Google Health and Stanford Medicine.
- TCGA (The Cancer Genome Atlas): A joint effort of National Cancer Institute and National Human Genome Research Institute. Data from TCGA are available via the Genomic Data Commons (GDC)
- CAMELYON: The data was collected from Radboud University Medical Center and University Medical Center Utrecht in the Netherlands.
- PMC-OA (PubMed Central Open Access Subset): Maintained by the National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI), which are part of the NIH.
- MedQA: This dataset was created by a team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits.
- MedMCQA: This dataset was created by Ankit Pal, Logesh Kumar Umapathi and Malaikannan Sankarasubbu from Saama AI Research, Chennai, India
- PubMedQA: This dataset was created by Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu from the University of Pittsburg, Carnegie Mellon University and Google.
- LiveQA: This dataset was created by Ben Abacha Asma, Eugene Agichtein Yuval Pinter and Dina Demner-Fushman from the U.S. National Library of Medicine, Emory University and Georgia Institute of Technology.
- Mendeley Digital Knee X-Ray: This dataset is from Rani Channamma University, and is hosted on Mendeley Data.
- AfriMed-QA: This data was developed and led by multiple collaborating organizations and researchers include key contributors: Intron Health, SisonkeBiotik, BioRAMP, Georgia Institute of Technology, and MasakhaneNLP.
- VQA-RAD: This dataset was created by a research team led by Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman and their affiliated institutions (the US National Library of Medicine and National Institutes of Health)
- Chest ImaGenome: IBM Research.
- MedExpQA: This dataset was created by researchers at the HiTZ Center (Basque Center for Language Technology and Artificial Intelligence).
- MedXpertQA: This dataset was developed by researchers at Tsinghua University (Beijing, China) and Shanghai Artificial Intelligence Laboratory (Shanghai, China).
- HealthSearchQA: This dataset consists of consisting of 3,173 commonly searched consumer questions.
- ISIC: International Skin Imaging Collaboration is a joint effort involving clinicians, researchers, and engineers from various institutions worldwide.
- Mendeley Clinical Laboratory Test Reports: This dataset is hosted on Mendeley and includes 260 clinical laboratory test reports issued by 24 laboratories in Egypt.
- CT-RATE: Istanbul Medipol University Mega Hospital and University of Zurich / ETH Zurich.
In addition to the public datasets listed above, MedGemma was also trained on de-identified, licensed datasets or datasets collected internally at Google from consented participants.
- CT dataset 1: De-identified dataset of different axial CT studies across body parts (head, chest, abdomen) from a US-based radiology outpatient diagnostic center network.
- MRI dataset 1: De-identified dataset of different axial multi-parametric MRI studies across body parts (head, abdomen, knee) from a US-based radiology outpatient diagnostic center network
- Ophthalmology dataset 1 (EyePACS): De-identified dataset of fundus images from diabetic retinopathy screening.
- Dermatology dataset 1: De-identified dataset of teledermatology skin condition images (both clinical and dermatoscopic) from Colombia.
- Dermatology dataset 2: De-identified dataset of skin cancer images (both clinical and dermatoscopic) from Australia.
- Dermatology dataset 3: De-identified dataset of non-diseased skin images from an internal data collection effort.
- Dermatology dataset 4: De-identified dataset featuring multiple images and longitudinal visits and records from Japan.
- Dermatology dataset 5: Dermatology dataset featuring unlabeled images.
- Dermatology dataset 6: De-identified cases from adult patients with data representing Fitzpatrick 5 or 6 skin types
- Pathology dataset 1: De-identified dataset of histopathology H&E whole slide images created in collaboration with an academic research hospital and biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes.
- Pathology dataset 2: De-identified dataset of lung histopathology H&E and IHC whole slide images created by a commercial biobank in the United States.
- Pathology dataset 3: De-identified dataset of prostate and lymph node H&E and IHC histopathology whole slide images created by a contract research organization in the United States.
- Pathology dataset 4: De-identified dataset of histopathology whole slide images created in collaboration with a large, tertiary teaching hospital in the United States. Comprises a diverse set of tissue and stain types, predominantly H&E.
- EHR dataset 1: Question/answer dataset drawn from synthetic FHIR records created by Synthea. The test set includes 19 unique patients with 200 questions per patient divided into 10 different categories.
- EHR dataset 2: De-identified Lab Reports across different departments in Pathology such as Biochemistry, Clinical Pathology, Hematology, Microbiology and Serology
- EHR dataset 3: De-identified Lab Reports across different departments in Pathology such as Biochemistry, Clinical Pathology, Hematology, Microbiology and Serology from at least 25 different labs
- EHR dataset 4: Synthetic dataset of laboratory reports
- EHR dataset 5: Synthetic dataset of approximately 60,000 health-relevant user queries
Data citation
- MIMIC-CXR: Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. https://physionet.org/content/mimic-cxr/2.1.0/ and Johnson, Alistair E. W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng. 2019. "MIMIC-CXR, a de-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports." Scientific Data 6 (1): 1–8.
- MS-CXR-T: Bannur, S., Hyland, S., Liu, Q., Pérez-García, F., Ilse, M., Coelho de Castro, D., Boecking, B., Sharma, H., Bouzid, K., Schwaighofer, A., Wetscherek, M. T., Richardson, H., Naumann, T., Alvarez Valle, J., & Oktay, O. (2023). MS-CXR-T: Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing (version 1.0.0). PhysioNet. https://doi.org/10.13026/pg10-j984.
- ChestX-ray14: Wang, Xiaosong, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M. Summers. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097-2106. 2017.
- SLAKE: Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering." http://arxiv.org/abs/2102.09542.
- PAD-UFES-20: Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones." Data in brief 32 (2020): 106221.
- SCIN: Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, et al. 2024. "Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements." JAMA Network Open 7 (11): e2446615–e2446615.
- TCGA: The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
- CAMELYON16: Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, et al. 2017. "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer." JAMA 318 (22): 2199–2210.
- CAMELYON17: Bandi, Peter, et al. "From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge." IEEE transactions on medical imaging 38.2 (2018): 550-560.
- Mendeley Digital Knee X-Ray: Gornale, Shivanand; Patravali, Pooja (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi: 10.17632/t9ndx37v5h.1
- VQA-RAD: Lau, Jason J., Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. "A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images." Scientific Data 5 (1): 1–10.
- Chest ImaGenome: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/wv01-y230
- MedQA: Jin, Di, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020. "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams." http://arxiv.org/abs/2009.13081.
- MedMCQA: Pal, Ankit, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. "Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering." Conference on health, inference, and learning. PMLR, 2022.
- PubMedQA: Jin, Qiao, et al. "Pubmedqa: A dataset for biomedical research question answering." Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2019.
- LiveQA: Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.
- AfriMed-QA: Olatunji, Tobi, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, et al. 2024. "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset." http://arxiv.org/abs/2411.15640.
- MedExpQA: Alonso, I., Oronoz, M., & Agerri, R. (2024). MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering. arXiv preprint arXiv:2404.05590. Retrieved from https://arxiv.org/abs/2404.05590
- MedXpertQA: Zuo, Yuxin, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. "MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding." http://arxiv.org/abs/2501.18362.
- HealthSearchQA: Singhal, Karan, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales et al. "Large language models encode clinical knowledge." Nature 620, no. 7972 (2023): 172-180.
- ISIC: Gutman, David; Codella, Noel C. F.; Celebi, Emre; Helba, Brian; Marchetti, Michael; Mishra, Nabin; Halpern, Allan. "Skin Lesion Analysis toward Melanoma Detection: A Challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC)". eprint arXiv:1605.01397. 2016
- Mendeley Clinical Laboratory Test Reports: Abdelmaksoud, Esraa; Gadallah, Ahmed; Asad, Ahmed (2022), “Clinical Laboratory Test Reports”, Mendeley Data, V2, doi: 10.17632/bygfmk4rx9.2
- CheXpert: Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, D. A., Halabi, S. S., Sandberg, J. K., Jones, R., Larson, D. B., Langlotz, C. P., Patel, B. N., Lungren, M. P., & Ng, A. Y. (2019). CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. arXiv:1901.07031
- CT-RATE: Hamamci, I. E., Er, S., Almas, F., Simsek, A. G., Esirgun, S. N., Dogan, I., Dasdelen, M. F., Wittmann, B., Menze, B., et al. (2024). CT-RATE Dataset. Hugging Face. https://huggingface.co/datasets/ibrahimhamamci/CT-RATE and Hamamci, Ibrahim Ethem, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian Wittmann, et al. 2024. "Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography." arXiv preprint arXiv:2403.17834. https://arxiv.org/abs/2403.17834
- EHRNoteQA: Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwanghyun Kim, Jeewon Yang, Seunghyun Won, Edward Choi. (2024) “EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries.” arXiv:2402.16040
De-identification/anonymization:
Google and its partners utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy.
Implementation information
Details about the model internals.
Software
Training was done using JAX.
JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.
Use and limitations
Intended use
MedGemma is an open multimodal generative AI model intended to be used as a starting point that enables more efficient development of downstream healthcare applications involving medical text and images. MedGemma is intended for developers in the life sciences and healthcare space. Developers are responsible for training, adapting, and making meaningful changes to MedGemma to accomplish their specific intended use. MedGemma models can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions.
MedGemma is based on Gemma 3 and has been further trained on medical images and text. MedGemma enables further development in medical contexts (image and textual); however, the model has been trained using chest x-ray, histopathology, dermatology, fundus images, CT, MR, medical text/documents and electronic health records (EHR) data. Examples of tasks within MedGemma’s training include visual question answering pertaining to medical images, such as radiographs, document understanding, or providing answers to textual medical questions.
Benefits
- Provides strong baseline medical image and text comprehension for models of its size.
- This strong performance makes it efficient to adapt for downstream healthcare-based use cases, compared to models of similar size without medical data pre-training.
- This adaptation may involve prompt engineering, grounding, agentic orchestration or fine-tuning depending on the use case, baseline validation requirements, and desired performance characteristics.
Limitations
MedGemma is not intended to be used without appropriate validation, adaptation, and/or making meaningful modification by developers for their specific use case. The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. All outputs from MedGemma should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies.
MedGemma's multimodal capabilities have been primarily evaluated on single-image tasks. MedGemma has not been evaluated in use cases that involve comprehension of multiple images.
MedGemma has not been evaluated or optimized for multi-turn applications.
MedGemma's training may make it more sensitive to the specific prompt used than Gemma 3.
When adapting MedGemma developer should consider the following:
- Bias in validation data: As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, imaging device, etc).
- Data contamination concerns: When evaluating the generalization capabilities of a large model like MedGemma in a medical context, there is a risk of data contamination, where the model might have inadvertently seen related medical information during its pre-training, potentially overestimating its true ability to generalize to novel medical concepts. Developers should validate MedGemma on datasets not publicly available or otherwise made available to non-institutional researchers to mitigate this risk.
Release notes
MedGemma 4B IT
- May 20, 2025: Initial release
- July 9, 2025 Bug fix: Fixed the subtle degradation in the multimodal performance. The issue was due to a missing end-of-image token in the model vocabulary, impacting combined text-and-image tasks. This fix reinstates and correctly maps that token, ensuring text-only tasks remain unaffected while restoring multimodal performance.
- Jan 13, 2026: Updated to version 1.5 with improved medical reasoning, medical records interpretation and medical image interpretation