Phonsiri/gemma-2-2b-Distillation-gemma-2-27b-it

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:8kPublished:Mar 3, 2026License:gemmaArchitecture:Transformer Warm

Phonsiri/gemma-2-2b-Distillation-gemma-2-27b-it is a 2.6 billion parameter Gemma 2 model developed by Phonsiri and CYP777, fine-tuned for structured mathematical reasoning and logic tasks. This model utilizes Reinforcement Learning (GRPO) and knowledge distillation from the larger gemma-2-27b-it to explicitly generate step-by-step reasoning. It excels at analytical problem-solving by outputting detailed thought processes before providing a final answer, making it suitable for applications requiring transparent and verifiable computations.

Loading preview...

Model Overview

Phonsiri/gemma-2-2b-Distillation-gemma-2-27b-it, also known as Gemma-2-2B Reasoning Edition (GRPO), is a specialized 2.6 billion parameter model built upon Google's gemma-2-2b-it. Developed by Phonsiri and CYP777, this model is uniquely engineered to perform structured mathematical reasoning and logic tasks by explicitly showing its work through a chain-of-thought process.

Key Capabilities

  • Step-by-step Reasoning: Unlike typical instruction-tuned models, this model is trained to output detailed reasoning steps within <reasoning> tags before providing a final answer in <answer> tags or \boxed{} format.
  • Enhanced Mathematical & Logic Problem Solving: Its training methodology, including Reinforcement Learning (GRPO) and knowledge distillation from the larger google/gemma-2-27b-it, significantly boosts its performance on analytical problems.
  • Structured Output: Adheres to a specific XML-like output format for reasoning and answers, making its outputs parseable and verifiable.

Training Methodology

The model underwent a two-stage training process:

  1. Supervised Fine-Tuning (SFT): Initial fine-tuning on open-r1/OpenR1-Math-220k to teach the model the structure of reasoning syntax.
  2. GRPO (Generative Reward Policy Optimization): Subsequent RL training with a custom reward system that incentivizes both correct mathematical answers and strict adherence to the <reasoning> XML formatting.

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Transparent and verifiable solutions to mathematical problems.
  • Educational tools that demonstrate problem-solving steps.
  • Automated systems needing logical deduction and analytical processing.