TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-MATH
The CoReward-Qwen3-8B-Base model, developed by TMLR-Group-HF, is an 8 billion parameter Qwen3-8B-Base model fine-tuned using the novel Co-rewarding self-supervised reinforcement learning method. It is specifically optimized for enhancing mathematical reasoning abilities in large language models, achieving stable training and superior performance on various mathematical benchmarks. This model demonstrates improved reasoning capabilities, notably surpassing traditional RL with ground-truth labels in some cases, such as a 94.01% Pass@1 on GSM8K.
Loading preview...
CoReward-Qwen3-8B-Base: Enhanced Mathematical Reasoning
CoReward-Qwen3-8B-Base is an 8 billion parameter model built upon the Qwen3-8B-Base architecture, developed by TMLR-Group-HF. Its key differentiator is the application of the novel Co-rewarding self-supervised reinforcement learning (RL) framework during fine-tuning on the MATH training set. This method addresses the limitations of traditional self-rewarding techniques by introducing complementary supervision, thereby improving training stability and preventing reward hacking.
Key Capabilities
- Stable Self-supervised RL: Utilizes Co-rewarding, a framework designed to provide stable training by seeking complementary supervision from multiple views, mitigating the training collapse issue common in other self-rewarding methods.
- Enhanced Mathematical Reasoning: Specifically fine-tuned to elicit and improve reasoning abilities in complex mathematical tasks.
- Superior Benchmark Performance: Outperforms other self-rewarding baselines by an average of +3.31% on multiple mathematical reasoning benchmarks. Notably, it achieves a Pass@1 of 94.01% on GSM8K, surpassing RL with ground-truth labels in certain scenarios.
- Two Instantiations: The Co-rewarding framework is instantiated in two ways: Co-rewarding-I (data-side, using contrastive agreement) and Co-rewarding-II (model-side, using a slowly-updated reference teacher for self-distillation).
Good For
- Applications requiring robust mathematical problem-solving and reasoning.
- Research into self-supervised reinforcement learning and its application to LLM reasoning.
- Tasks where high accuracy on quantitative reasoning benchmarks is critical.