TMLR-Group-HF/GT-Qwen3-4B-Base-DAPO14k

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Oct 3, 2025License:mitArchitecture:Transformer Open Weights Warm

GT-GRPO's GT-Qwen3-4B-Base-DAPO14k is a 4 billion parameter Qwen3-Base model, fine-tuned using the DAPO-14k dataset and the novel Co-rewarding self-supervised reinforcement learning framework. This model is specifically optimized to enhance reasoning abilities in large language models, particularly for complex mathematical tasks. It leverages a context length of 40960 tokens and aims to improve training stability and performance in reasoning benchmarks without relying on extensive human-annotated labels.

Loading preview...

GT-Qwen3-4B-Base-DAPO14k: Reasoning with Co-rewarding

This model is a 4 billion parameter Qwen3-Base variant developed by GT-GRPO, fine-tuned using the DAPO-14k dataset. Its core innovation lies in the application of Co-rewarding, a novel self-supervised reinforcement learning (RL) framework designed to elicit and improve reasoning capabilities in large language models (LLMs).

Key Capabilities & Features

  • Enhanced Reasoning: Specifically trained to improve performance on complex reasoning tasks, particularly in mathematics.
  • Self-supervised RL: Utilizes the Co-rewarding framework, which avoids the need for extensive human-annotated labels, addressing the scaling dilemma of traditional RL with verifiable rewards (RLVR).
  • Training Stability: Co-rewarding introduces complementary supervision views (data-side Co-rewarding-I and model-side Co-rewarding-II) to mitigate training collapse and reward hacking issues common in other self-rewarding methods.
  • Competitive Performance: Empirically demonstrates stable training and outperforms other self-rewarding baselines, achieving significant improvements on mathematical reasoning benchmarks. In some cases, it can even surpass RLVR with ground-truth labels.

Good For

  • Applications requiring strong mathematical and complex reasoning capabilities.
  • Research and development in self-supervised reinforcement learning for LLMs.
  • Scenarios where reducing reliance on human-annotated labels for reasoning task fine-tuning is critical.

For more in-depth information on the Co-rewarding framework, including code and datasets, refer to the official GitHub Repository and the associated paper.