Name: TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-MATH API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: TMLR-Group-HF

CoReward-Qwen3-8B-Base: Enhanced Mathematical Reasoning

CoReward-Qwen3-8B-Base is an 8 billion parameter model built upon the Qwen3-8B-Base architecture, developed by TMLR-Group-HF. Its key differentiator is the application of the novel Co-rewarding self-supervised reinforcement learning (RL) framework during fine-tuning on the MATH training set. This method addresses the limitations of traditional self-rewarding techniques by introducing complementary supervision, thereby improving training stability and preventing reward hacking.

Key Capabilities

Stable Self-supervised RL: Utilizes Co-rewarding, a framework designed to provide stable training by seeking complementary supervision from multiple views, mitigating the training collapse issue common in other self-rewarding methods.
Enhanced Mathematical Reasoning: Specifically fine-tuned to elicit and improve reasoning abilities in complex mathematical tasks.
Superior Benchmark Performance: Outperforms other self-rewarding baselines by an average of +3.31% on multiple mathematical reasoning benchmarks. Notably, it achieves a Pass@1 of 94.01% on GSM8K, surpassing RL with ground-truth labels in certain scenarios.
Two Instantiations: The Co-rewarding framework is instantiated in two ways: Co-rewarding-I (data-side, using contrastive agreement) and Co-rewarding-II (model-side, using a slowly-updated reference teacher for self-distillation).

Good For

Applications requiring robust mathematical problem-solving and reasoning.
Research into self-supervised reinforcement learning and its application to LLM reasoning.
Tasks where high accuracy on quantitative reasoning benchmarks is critical.

Overview

CoReward-Qwen3-8B-Base: Enhanced Mathematical Reasoning

Key Capabilities

Good For

Full Model Card (README)