KickItLikeShika/Qwen2.5-1.5B-Instruct-SFT-GRPO-GSM8K
The KickItLikeShika/Qwen2.5-1.5B-Instruct-SFT-GRPO-GSM8K is a 1.5 billion parameter instruction-tuned Qwen2.5 model developed by KickItLikeShika. It is specifically optimized for grade-level mathematical reasoning tasks, generating structured outputs with a scratchpad and a final numerical answer. This model excels at solving math problems by explicitly detailing its reasoning process, making it suitable for applications requiring transparent mathematical problem-solving.
Loading preview...
Model Overview
The KickItLikeShika/Qwen2.5-1.5B-Instruct-SFT-GRPO-GSM8K is a 1.5 billion parameter Qwen2.5-based instruction-tuned model, specifically engineered for grade-level mathematical reasoning. Its primary distinction lies in its ability to produce structured outputs for math problems, including a detailed scratchpad within <reasoning>…</reasoning> tags and a single numerical answer within <answer>…</answer> tags.
Key Capabilities
- Structured Mathematical Reasoning: Generates explicit step-by-step reasoning processes for math problems.
- Grade-Level Math Proficiency: Fine-tuned to solve mathematical problems typically found at the grade school level.
- Instruction-Tuned: Optimized to follow instructions for generating structured math solutions.
Training Methodology
The model underwent a two-stage training process:
- LoRA SFT: Initial fine-tuning using Low-Rank Adaptation (LoRA) on 100 random GSM8K training examples. This stage focused on teaching the model the desired output format and generating roughly sensible reasoning traces.
- GRPO: Subsequent training using Guided Reinforcement Learning from Pre-training Objectives (GRPO) for 2,000 steps, building upon the initial LoRA adapter.
Good For
- Applications requiring transparent and verifiable mathematical problem-solving.
- Educational tools that need to show step-by-step solutions to math problems.
- Scenarios where a small, specialized model for math reasoning is preferred over larger, general-purpose LLMs.