mehti/medgemma-cataract-surgical-analysis

VISIONConcurrency Cost:1Model Size:4.3BQuant:BF16Ctx Length:32kPublished:Feb 24, 2026License:health-ai-developer-foundationsArchitecture:Transformer Cold

The mehti/medgemma-cataract-surgical-analysis model is a fine-tuned version of Google's MedGemma 1.5 4B IT, specialized for expert-level analysis of cataract surgery video frames. It provides a Chain-of-Thought (CoT) reasoning process to interpret surgical phases, instrument-anatomy relationships, and safety margins, followed by clear, actionable instructions for surgical residents. Trained on the Cataract-1K dataset, this model excels at generating structured responses for AI-assisted surgical training and research.

Loading preview...

MedGemma 1.5 4B IT — Cataract Surgical Analysis

This model is a specialized fine-tuned version of Google's MedGemma 1.5 4B IT, specifically designed for the analysis of cataract surgery video frames. It integrates a Chain-of-Thought (CoT) approach to deliver expert-level reasoning and safety instructions, making it distinct from general-purpose models.

Key Capabilities

  • Expert-level Reasoning: Provides a detailed "Thinking Process" that analyzes surgical phases, identifies instrument-anatomy relationships, and assesses safety margins within surgical video frames.
  • Actionable Instructions: Generates a "Final Answer" with clear, concise instructions suitable for a surgical resident.
  • Structured Output: Delivers responses in a consistent format, separating the reasoning trace from the final instruction.
  • Specialized Training: Fine-tuned on the Cataract-1K dataset, a component of the LMOD benchmark, with reasoning traces distilled from Qwen3-VL-30B-A3B-Thinking.

Good for

  • Research in Medical AI: Ideal for exploring multimodal AI capabilities within surgical domains.
  • AI-assisted Surgical Training: Serves as a prototype for educational systems aimed at surgical residents.
  • Interpreting Surgical Video: Excels at providing detailed analysis and guidance based on individual frames from cataract surgery videos.

This model was fine-tuned using LoRA with 4bit-nf4-double_quant quantization and achieved stable learning, with evaluation loss converging to 0.19–0.24 and token accuracy reaching ~0.94 in the best-performing fold.