TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Oct 3, 2025License:mitArchitecture:Transformer0.0K Open Weights Warm

TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k is an 8 billion parameter Qwen3-Base model, developed by Co-rewarding-I, fine-tuned using the DAPO-14k dataset. It leverages the Co-rewarding-I framework, a self-supervised reinforcement learning method, to enhance reasoning capabilities. This model is specifically designed to improve stability and performance in complex reasoning tasks, such as mathematical reasoning, by mitigating common training issues.

Loading preview...

Model Overview

This model, TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k, is an 8 billion parameter Qwen3-Base architecture. It was developed by Co-rewarding-I and fine-tuned on the DAPO-14k dataset. The core innovation lies in its training methodology, which utilizes the Co-rewarding-I framework.

Key Capabilities

  • Enhanced Reasoning: The model is specifically designed to improve reasoning abilities in large language models (LLMs).
  • Stable Self-supervised RL: It employs Co-rewarding-I, a data-side instantiation of a novel self-supervised reinforcement learning (RL) framework. This approach aims to provide more stable training compared to traditional single-view self-rewarding methods.
  • Mitigation of Training Issues: Co-rewarding-I addresses common problems like training collapse and reward hacking by deriving reward signals from contrastive agreement across semantically analogous questions.
  • Mathematical Reasoning: The framework is particularly effective for complex challenges such as mathematical reasoning.

What Makes This Model Different?

This model stands out due to its Co-rewarding-I training framework, which is detailed in the paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models" [https://huggingface.co/papers/2508.00410]. Unlike many other LLMs, its training focuses on a robust self-supervised RL approach to elicit and stabilize reasoning capabilities, rather than relying solely on supervised fine-tuning or simpler RL methods. This makes it particularly suitable for applications requiring reliable and consistent reasoning performance.