Model Overview

This model, developed by Matthew Chung, is a 3.1 billion parameter, transformer-based language model fine-tuned from Qwen/Qwen2.5-3B-Instruct. It leverages Generalized Reinforcement Policy Optimization (GRPO) to specialize in medical reasoning tasks, utilizing the Unsloth library for efficient training.

Key Capabilities

Medical Reasoning: Optimized for understanding and generating responses related to medical reasoning, trained on the FreedomIntelligence/medical-o1-reasoning-SFT dataset.
GRPO Fine-tuning: Incorporates custom reward functions for semantic correctness and perplexity, enhancing its medical domain performance.
Educational Use: Primarily intended for educational purposes in medical contexts.

Training Details

The model was trained on 25,117 samples from the FreedomIntelligence/medical-o1-reasoning-SFT dataset over approximately 14 hours on an NVIDIA RTX 3090. Training involved a learning rate of 5e-6, a batch size of 1, and 4-bit quantization, achieving a final loss of 0.001300 and a semantic score of 0.630995.

Limitations and Recommendations

Educational Use Only: Explicitly stated as not intended for direct medical advice, diagnosis, or treatment recommendations.
Potential for Inaccuracy: May generate incorrect or misleading medical information.
Human Oversight Required: Not suitable for high-stakes medical decision-making without professional human oversight.
Bias Awareness: Users should be aware of potential biases inherited from the training data and verify outputs with medical professionals.

Overview

Model Overview

Key Capabilities

Training Details

Limitations and Recommendations

Full Model Card (README)