TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-MATH
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Aug 4, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The CoReward-Qwen3-8B-Base model, developed by TMLR-Group-HF, is an 8 billion parameter Qwen3-8B-Base model fine-tuned using the novel Co-rewarding self-supervised reinforcement learning method. It is specifically optimized for enhancing mathematical reasoning abilities in large language models, achieving stable training and superior performance on various mathematical benchmarks. This model demonstrates improved reasoning capabilities, notably surpassing traditional RL with ground-truth labels in some cases, such as a 94.01% Pass@1 on GSM8K.

Loading preview...