youngzhong/SOD-0.6B
youngzhong/SOD-0.6B is a 0.6 billion parameter student language model developed by youngzhong, distilled from a 4B teacher model using Step-wise On-policy Distillation (SOD). This method is specifically designed to train small language model agents with enhanced tool-integrated reasoning capabilities, addressing cascading error propagation in on-policy distillation. Built upon Qwen3-0.6B, it excels in challenging math, science, and code benchmarks, demonstrating strong performance for agentic tasks.
Loading preview...
Overview
youngzhong/SOD-0.6B is a 0.6 billion parameter student model that has been distilled from a 4 billion parameter teacher model using a novel technique called Step-wise On-policy Distillation (SOD). This method is specifically engineered to train small language model agents, focusing on improving their tool-integrated reasoning abilities. SOD tackles the common issue of cascading error propagation in on-policy distillation by employing an adaptive step-level weighting mechanism. This mechanism effectively suppresses distillation loss on drifted steps and restores supervision when the student model realigns, all with minimal additional computational overhead.
Key Capabilities & Features
- Agentic Reasoning: Optimized for tasks requiring tool-integrated reasoning, making it suitable for agent-based applications.
- Efficient Distillation: Utilizes SOD to create a highly capable small model (0.6B parameters) from a larger teacher (4B parameters) without significant computational cost.
- Error Mitigation: The SOD method specifically addresses and reduces cascading error propagation during on-policy distillation.
- Strong Performance: Achieves notable results on challenging benchmarks, including AIME, GPQA-Diamond, and LiveCodeBench-v6.
Performance Highlights
This model demonstrates significant performance gains over other 0.6B baselines, particularly in complex reasoning tasks:
- Achieves 26.13% on AIME 2025 (average@32).
- Shows an average improvement of +20.86% over the second-best baseline (OPD) across evaluated benchmarks.
Good For
- Developing small, efficient language model agents.
- Applications requiring tool-integrated reasoning where model size is a constraint.
- Tasks involving complex math, science, and code problem-solving.