TMLR-Group-HF/Co-rewarding-III-Qwen3-8B-Base-DAPO14k

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Dec 21, 2025License:mitArchitecture:Transformer Open Weights Warm

The TMLR-Group-HF/Co-rewarding-III-Qwen3-8B-Base-DAPO14k model is an 8 billion parameter Qwen3-Base model developed by TMLR-Group. It is fine-tuned using the Co-rewarding-III method on the DAPO14k dataset, focusing on stable self-supervised reinforcement learning. This model is specifically designed to elicit and enhance reasoning capabilities in large language models, making it suitable for tasks requiring advanced logical inference.

Loading preview...

Overview

This model, Co-rewarding-III-Qwen3-8B-Base-DAPO14k, is an 8 billion parameter Qwen3-Base model developed by TMLR-Group. Its core innovation lies in its fine-tuning approach, utilizing the Co-rewarding-III method on the DAPO14k training set. This methodology is detailed in the paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models" (arXiv:2508.00410).

Key Capabilities

  • Enhanced Reasoning: Specifically trained to improve and elicit reasoning abilities in large language models through a stable self-supervised reinforcement learning framework.
  • Co-rewarding Framework: Leverages the Co-rewarding-III method, a novel approach for fine-tuning, to achieve its reasoning capabilities.
  • Qwen3-Base Architecture: Built upon the robust Qwen3-8B-Base model, providing a strong foundation for its specialized fine-tuning.

Good For

  • Research in Self-supervised RL: Ideal for researchers exploring stable self-supervised reinforcement learning techniques for LLMs.
  • Reasoning-intensive Tasks: Suitable for applications requiring advanced logical inference and problem-solving from language models.
  • Benchmarking Reasoning: Can be used as a baseline or comparison model for evaluating reasoning performance in LLMs.