youngzhong/SOD-GRPO_teacher-4B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:May 12, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

youngzhong/SOD-GRPO_teacher-4B is a 4 billion parameter agentic reasoning model developed by youngzhong, based on Qwen3-4B. It is trained with Group Relative Policy Optimization (GRPO) and serves as a teacher model within the SOD distillation framework. This model is specifically designed to distill smaller student models for tool-integrated reasoning, demonstrating strong performance on challenging math, science, and code benchmarks.

Loading preview...

Model Overview

SOD-GRPO_teacher-4B is a 4 billion parameter agentic reasoning model developed by youngzhong, built upon the Qwen3-4B base model. It is trained using Group Relative Policy Optimization (GRPO), a method designed to enhance agentic reasoning capabilities. This model's primary role is to act as a teacher within the SOD (Step-wise On-policy Distillation) framework, facilitating the distillation of knowledge to smaller student models like SOD-0.6B and SOD-1.7B.

Key Capabilities & Purpose

  • Agentic Reasoning: Optimized for complex reasoning tasks, particularly those involving tool integration.
  • Teacher Model for Distillation: Serves as the high-performing source for distilling smaller, more efficient student models using the SOD method.
  • Enhanced Performance: Achieves strong results on challenging benchmarks, including AIME 2024 (67.60), AIME 2025 (60.42), GPQA-Diamond (55.19), and LiveCodeBench-v6 (63.13), with an average score of 61.59.

When to Use This Model

This model is ideal for researchers and developers focused on:

  • Developing Smaller Agentic Models: If your goal is to create compact yet capable agentic models, SOD-GRPO_teacher-4B provides a robust teacher for distillation.
  • Research in Agentic Reasoning & Distillation: It's a valuable resource for exploring advanced techniques like GRPO and SOD for improving LLM agents.
  • Benchmarking Agentic Performance: Its reported performance on demanding math, science, and code tasks makes it a strong baseline for comparison.