TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k
TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k is an 8 billion parameter Qwen3-Base model developed by TMLR-Group-HF, fine-tuned on the DAPO-14k dataset. This model leverages the Co-rewarding self-supervised reinforcement learning framework to significantly enhance reasoning abilities in large language models. It is specifically optimized for mathematical reasoning tasks, demonstrating improved stability and performance over other self-rewarding baselines. The model supports a context length of 32768 tokens.
Loading preview...
Model Overview
TMLR-Group-HF/GT-Qwen3-8B-Base-DAPO14k is an 8 billion parameter Qwen3-Base model, developed by TMLR-Group-HF, that has been fine-tuned using the DAPO-14k dataset. This model is a direct result of research into Co-rewarding, a novel self-supervised reinforcement learning (RL) framework designed to improve the reasoning capabilities of large language models (LLMs). The core innovation of Co-rewarding is its ability to address training instability common in self-rewarding methods by utilizing complementary supervision from multiple perspectives.
Key Capabilities
- Enhanced Reasoning: Specifically engineered to boost the reasoning abilities of LLMs, particularly in complex problem-solving scenarios.
- Stable Self-supervised Learning: Employs the Co-rewarding framework, which includes data-side (Co-rewarding-I) and model-side (Co-rewarding-II) instantiations, to ensure more stable training compared to traditional self-rewarding approaches.
- Mathematical Reasoning: Demonstrates significant performance improvements on various mathematical reasoning benchmarks, often outperforming RLVR methods that rely on ground-truth labels.
- Large Context Window: Supports a substantial context length of 32768 tokens, allowing for processing and understanding longer inputs.
Good For
- Research in RL and LLMs: Ideal for researchers exploring advanced self-supervised learning techniques and reinforcement learning applications in language models.
- Mathematical Problem Solving: Suited for applications requiring robust mathematical reasoning and logical deduction.
- Benchmarking: Can be used as a strong baseline for evaluating new reasoning-focused LLM techniques.
For more technical details and the underlying research, refer to the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models and the official GitHub repository.