Model Overview
TMLR-Group-HF/GT-Qwen3-4B-Base-MATH is a 4 billion parameter Qwen3-Base model developed by TMLR-Group-HF, specifically fine-tuned for mathematical reasoning. This model utilizes the Ground Truth (GRPO) method and is trained on a dedicated MATH dataset, as detailed in the research paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models".
Key Capabilities
- Enhanced Mathematical Reasoning: Optimized for solving mathematical problems through specialized training.
- Co-rewarding Framework: Incorporates a novel self-supervised reinforcement learning (RL) framework designed to improve training stability and elicit reasoning in LLMs.
- Stable Self-supervised Learning: Addresses common issues like scaling dilemmas and training collapse found in other self-rewarding methods.
- Two Instantiations: The Co-rewarding framework is implemented via Co-rewarding-I (data-side, using contrastive agreement) and Co-rewarding-II (model-side, using self-distillation with a slowly-updated reference teacher).
Good For
- Mathematical Problem Solving: Ideal for applications requiring strong mathematical reasoning abilities.
- Research in RL for LLMs: Serves as a practical example of the Co-rewarding framework for stable self-supervised learning.
- Benchmarking Reasoning Tasks: Can be used to evaluate and compare performance on complex reasoning benchmarks.