GT-Qwen3-4B-Base-DAPO14k: Reasoning with Co-rewarding
This model is a 4 billion parameter Qwen3-Base variant developed by GT-GRPO, fine-tuned using the DAPO-14k dataset. Its core innovation lies in the application of Co-rewarding, a novel self-supervised reinforcement learning (RL) framework designed to elicit and improve reasoning capabilities in large language models (LLMs).
Key Capabilities & Features
- Enhanced Reasoning: Specifically trained to improve performance on complex reasoning tasks, particularly in mathematics.
- Self-supervised RL: Utilizes the Co-rewarding framework, which avoids the need for extensive human-annotated labels, addressing the scaling dilemma of traditional RL with verifiable rewards (RLVR).
- Training Stability: Co-rewarding introduces complementary supervision views (data-side Co-rewarding-I and model-side Co-rewarding-II) to mitigate training collapse and reward hacking issues common in other self-rewarding methods.
- Competitive Performance: Empirically demonstrates stable training and outperforms other self-rewarding baselines, achieving significant improvements on mathematical reasoning benchmarks. In some cases, it can even surpass RLVR with ground-truth labels.
Good For
- Applications requiring strong mathematical and complex reasoning capabilities.
- Research and development in self-supervised reinforcement learning for LLMs.
- Scenarios where reducing reliance on human-annotated labels for reasoning task fine-tuning is critical.
For more in-depth information on the Co-rewarding framework, including code and datasets, refer to the official GitHub Repository and the associated paper.