google/medgemma-4b-it

5.0 based on 1 review
Warm
Public
Vision
4.3B
BF16
32768
May 19, 2025
License: other
Hugging Face
Gated
Overview

MedGemma-4b-it: Specialized Multimodal AI for Healthcare

MedGemma-4b-it is a 4.3 billion parameter instruction-tuned model from Google, built upon the Gemma 3 architecture and specifically optimized for healthcare AI applications. It integrates a SigLIP image encoder, pre-trained on a wide array of de-identified medical images such as chest X-rays, dermatology, ophthalmology, and histopathology slides. The model's language component is trained on diverse medical text and question-answer pairs, enabling robust comprehension and generation capabilities in clinical contexts.

Key Capabilities

  • Multimodal Medical Comprehension: Processes both medical text and images, including radiology, dermatology, and pathology. Images are normalized to 896x896 resolution and encoded to 256 tokens.
  • Enhanced Medical Performance: Significantly outperforms the base Gemma 3 4B model across various medical benchmarks, including medical image classification (e.g., MIMIC CXR macro F1 88.9), visual question answering (e.g., SLAKE Tokenized F1 72.3), and text-only medical reasoning (e.g., MedQA 64.4).
  • Text Generation for Healthcare: Optimized for applications requiring text generation, such as chest X-ray report generation, achieving a RadGraph F1 of 30.3 when tuned for CXR.
  • Long Context Support: Features a context length of at least 128K tokens, allowing for extensive input processing.

Good for

  • Developers building healthcare-based AI applications that require both text and image understanding.
  • Tasks such as medical visual question answering, medical report generation, and medical text analysis.
  • Fine-tuning for specific clinical use cases using proprietary data to achieve improved performance.
  • Applications where a strong baseline in medical image and text comprehension is crucial for models of its size.