CoReward-Qwen3-8B-Base: Enhanced Mathematical Reasoning
CoReward-Qwen3-8B-Base is an 8 billion parameter model built upon the Qwen3-8B-Base architecture, developed by TMLR-Group-HF. Its key differentiator is the application of the novel Co-rewarding self-supervised reinforcement learning (RL) framework during fine-tuning on the MATH training set. This method addresses the limitations of traditional self-rewarding techniques by introducing complementary supervision, thereby improving training stability and preventing reward hacking.
Key Capabilities
- Stable Self-supervised RL: Utilizes Co-rewarding, a framework designed to provide stable training by seeking complementary supervision from multiple views, mitigating the training collapse issue common in other self-rewarding methods.
- Enhanced Mathematical Reasoning: Specifically fine-tuned to elicit and improve reasoning abilities in complex mathematical tasks.
- Superior Benchmark Performance: Outperforms other self-rewarding baselines by an average of +3.31% on multiple mathematical reasoning benchmarks. Notably, it achieves a Pass@1 of 94.01% on GSM8K, surpassing RL with ground-truth labels in certain scenarios.
- Two Instantiations: The Co-rewarding framework is instantiated in two ways: Co-rewarding-I (data-side, using contrastive agreement) and Co-rewarding-II (model-side, using a slowly-updated reference teacher for self-distillation).
Good For
- Applications requiring robust mathematical problem-solving and reasoning.
- Research into self-supervised reinforcement learning and its application to LLM reasoning.
- Tasks where high accuracy on quantitative reasoning benchmarks is critical.