youngzhong/SOD-GRPO_teacher-4B
youngzhong/SOD-GRPO_teacher-4B is a 4 billion parameter agentic reasoning model developed by youngzhong, based on Qwen3-4B. It is trained with Group Relative Policy Optimization (GRPO) and serves as a teacher model within the SOD distillation framework. This model is specifically designed to distill smaller student models for tool-integrated reasoning, demonstrating strong performance on challenging math, science, and code benchmarks.
Loading preview...
Model Overview
SOD-GRPO_teacher-4B is a 4 billion parameter agentic reasoning model developed by youngzhong, built upon the Qwen3-4B base model. It is trained using Group Relative Policy Optimization (GRPO), a method designed to enhance agentic reasoning capabilities. This model's primary role is to act as a teacher within the SOD (Step-wise On-policy Distillation) framework, facilitating the distillation of knowledge to smaller student models like SOD-0.6B and SOD-1.7B.
Key Capabilities & Purpose
- Agentic Reasoning: Optimized for complex reasoning tasks, particularly those involving tool integration.
- Teacher Model for Distillation: Serves as the high-performing source for distilling smaller, more efficient student models using the SOD method.
- Enhanced Performance: Achieves strong results on challenging benchmarks, including AIME 2024 (67.60), AIME 2025 (60.42), GPQA-Diamond (55.19), and LiveCodeBench-v6 (63.13), with an average score of 61.59.
When to Use This Model
This model is ideal for researchers and developers focused on:
- Developing Smaller Agentic Models: If your goal is to create compact yet capable agentic models, SOD-GRPO_teacher-4B provides a robust teacher for distillation.
- Research in Agentic Reasoning & Distillation: It's a valuable resource for exploring advanced techniques like GRPO and SOD for improving LLM agents.
- Benchmarking Agentic Performance: Its reported performance on demanding math, science, and code tasks makes it a strong baseline for comparison.