Overview
Model Overview
mkurman/Llama-3.2-MedIT-3B-R1 is a 3.2 billion parameter model, fine-tuned from meta-llama/Llama-3.2-3B-Instruct. It features a substantial 32768 token context length and was developed using a multi-stage training methodology. This includes Blurred Thoughts Supervised Fine-Tuning (BT-SFT) on the open-thoughts/OpenThoughts-114k dataset, followed by two stages of Group Relative Policy Optimization (GRPO). The GRPO stages leveraged an LLM evaluator and were trained on FreedomIntelligence/medical-o1-verifiable-problem and open-r1/OpenR1-Math-220k datasets, respectively. This specialized training aims to enhance its performance in specific reasoning and problem-solving tasks.
Key Capabilities
- Advanced Fine-Tuning Research: Explores sophisticated training methods like BT-SFT and GRPO with LLM evaluators.
- Specialized Reasoning: Enhanced for tasks involving verifiable medical problems and mathematical reasoning through targeted dataset training.
- High Context Length: Supports processing of long inputs with a 32768 token context window.
Good for
- Academic Research: Ideal for investigating advanced fine-tuning techniques and evaluating model performance in task-oriented conversational scenarios.
- Experimental Applications: Suitable for exploratory projects in controlled environments, particularly in medical and mathematical domains.
- Methodology Evaluation: Useful for researchers studying the impact of multi-stage training and LLM-based evaluation on model capabilities.