TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Oct 3, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k is an 8 billion parameter Qwen3-Base model developed by TMLR-Group-HF, fine-tuned on the DAPO-14k dataset. This model leverages the Co-rewarding self-supervised reinforcement learning framework to significantly enhance reasoning abilities in large language models. It is specifically optimized for mathematical reasoning tasks, demonstrating improved stability and performance over other self-rewarding baselines. The model supports a context length of 32768 tokens.

Loading preview...

Model Overview

TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k is an 8 billion parameter Qwen3-Base model, developed by TMLR-Group-HF, that has been fine-tuned using the DAPO-14k dataset. This model is a direct result of research into Co-rewarding, a novel self-supervised reinforcement learning (RL) framework designed to improve the reasoning capabilities of large language models (LLMs). The core innovation of Co-rewarding is its ability to address training instability common in self-rewarding methods by utilizing complementary supervision from multiple perspectives.

Key Capabilities

  • Enhanced Reasoning: Specifically engineered to boost the reasoning abilities of LLMs, particularly in complex problem-solving scenarios.
  • Stable Self-supervised Learning: Employs the Co-rewarding framework, which includes data-side (Co-rewarding-I) and model-side (Co-rewarding-II) instantiations, to ensure more stable training compared to traditional self-rewarding approaches.
  • Mathematical Reasoning: Demonstrates significant performance improvements on various mathematical reasoning benchmarks, often outperforming RLVR methods that rely on ground-truth labels.
  • Large Context Window: Supports a substantial context length of 32768 tokens, allowing for processing and understanding longer inputs.

Good For

  • Research in RL and LLMs: Ideal for researchers exploring advanced self-supervised learning techniques and reinforcement learning applications in language models.
  • Mathematical Problem Solving: Suited for applications requiring robust mathematical reasoning and logical deduction.
  • Benchmarking: Can be used as a strong baseline for evaluating new reasoning-focused LLM techniques.

For more technical details and the underlying research, refer to the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models and the official GitHub repository.