GT-Qwen3-4B-Base-DAPO14k: Reasoning with Co-rewarding

This model is a 4 billion parameter Qwen3-Base variant developed by GT-GRPO, fine-tuned using the DAPO-14k dataset. Its core innovation lies in the application of Co-rewarding, a novel self-supervised reinforcement learning (RL) framework designed to elicit and improve reasoning capabilities in large language models (LLMs).

Key Capabilities & Features

Enhanced Reasoning: Specifically trained to improve performance on complex reasoning tasks, particularly in mathematics.
Self-supervised RL: Utilizes the Co-rewarding framework, which avoids the need for extensive human-annotated labels, addressing the scaling dilemma of traditional RL with verifiable rewards (RLVR).
Training Stability: Co-rewarding introduces complementary supervision views (data-side Co-rewarding-I and model-side Co-rewarding-II) to mitigate training collapse and reward hacking issues common in other self-rewarding methods.
Competitive Performance: Empirically demonstrates stable training and outperforms other self-rewarding baselines, achieving significant improvements on mathematical reasoning benchmarks. In some cases, it can even surpass RLVR with ground-truth labels.

Good For

Applications requiring strong mathematical and complex reasoning capabilities.
Research and development in self-supervised reinforcement learning for LLMs.
Scenarios where reducing reliance on human-annotated labels for reasoning task fine-tuning is critical.

For more in-depth information on the Co-rewarding framework, including code and datasets, refer to the official GitHub Repository and the associated paper.

Overview

GT-Qwen3-4B-Base-DAPO14k: Reasoning with Co-rewarding

Key Capabilities & Features

Good For

Full Model Card (README)