Model Overview
Phonsiri/gemma-2-2b-Distillation-gemma-2-27b-it, also known as Gemma-2-2B Reasoning Edition (GRPO), is a specialized 2.6 billion parameter model built upon Google's gemma-2-2b-it. Developed by Phonsiri and CYP777, this model is uniquely engineered to perform structured mathematical reasoning and logic tasks by explicitly showing its work through a chain-of-thought process.
Key Capabilities
- Step-by-step Reasoning: Unlike typical instruction-tuned models, this model is trained to output detailed reasoning steps within
<reasoning> tags before providing a final answer in <answer> tags or \boxed{} format. - Enhanced Mathematical & Logic Problem Solving: Its training methodology, including Reinforcement Learning (GRPO) and knowledge distillation from the larger
google/gemma-2-27b-it, significantly boosts its performance on analytical problems. - Structured Output: Adheres to a specific XML-like output format for reasoning and answers, making its outputs parseable and verifiable.
Training Methodology
The model underwent a two-stage training process:
- Supervised Fine-Tuning (SFT): Initial fine-tuning on
open-r1/OpenR1-Math-220k to teach the model the structure of reasoning syntax. - GRPO (Generative Reward Policy Optimization): Subsequent RL training with a custom reward system that incentivizes both correct mathematical answers and strict adherence to the
<reasoning> XML formatting.
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Transparent and verifiable solutions to mathematical problems.
- Educational tools that demonstrate problem-solving steps.
- Automated systems needing logical deduction and analytical processing.