unsloth/medgemma-4b-it
Hugging Face
VISIONConcurrency Cost:1Model Size:4.3BQuant:BF16Ctx Length:32kPublished:May 20, 2025License:health-ai-developer-foundationsArchitecture:Transformer0.0K Warm

The unsloth/medgemma-4b-it is a 4.3 billion parameter instruction-tuned multimodal language model developed by Google, based on the Gemma 3 architecture. It is specifically trained for performance on medical text and image comprehension, utilizing a SigLIP image encoder pre-trained on diverse de-identified medical data. This model excels at medical applications involving text generation, visual question answering, and report generation from medical images like X-rays and histopathology slides.

Loading preview...

MedGemma 4B Instruction-Tuned Model

MedGemma 4B IT is a 4.3 billion parameter multimodal model from Google, built upon the Gemma 3 architecture and specifically optimized for healthcare AI applications. It integrates a SigLIP image encoder, pre-trained on a wide array of de-identified medical images including chest X-rays, dermatology, ophthalmology, and histopathology, alongside an LLM component trained on diverse medical text and question-answer pairs.

Key Capabilities

  • Multimodal Medical Comprehension: Processes both medical text and images (normalized to 896x896 resolution) to generate text outputs.
  • Specialized Medical Training: Significantly outperforms base Gemma 3 4B on medical image classification, visual question answering, and text-only medical benchmarks.
  • Report Generation: Demonstrates strong performance in generating chest X-ray reports, with fine-tuning capabilities to improve accuracy against specific ground truths.
  • Long Context Support: Supports a context length of at least 128K tokens for comprehensive input.

Good For

  • Developing Healthcare AI Applications: Serves as an efficient starting point for applications requiring medical text and image understanding.
  • Medical Text Generation: Ideal for tasks involving generating text responses, analyses, or summaries from medical inputs.
  • Visual Question Answering: Excels at answering questions based on medical images across various modalities.
  • Fine-tuning: Designed to be fine-tuned by developers with proprietary data for specific medical tasks, offering strong baseline performance for adaptation.