TMLR-Group-HF/GT-Qwen3-4B-Base-MATH

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Aug 5, 2025License:mitArchitecture:Transformer Open Weights Warm

TMLR-Group-HF/GT-Qwen3-4B-Base-MATH is a 4 billion parameter Qwen3-Base model developed by TMLR-Group-HF, specifically trained using the Ground Truth (GRPO) method on a MATH dataset. This model is optimized for mathematical reasoning tasks, leveraging a novel Co-rewarding self-supervised reinforcement learning framework. It aims to enhance reasoning capabilities in large language models by addressing stability and scaling challenges in self-rewarding methods.

Loading preview...

Model Overview

TMLR-Group-HF/GT-Qwen3-4B-Base-MATH is a 4 billion parameter Qwen3-Base model developed by TMLR-Group-HF, specifically fine-tuned for mathematical reasoning. This model utilizes the Ground Truth (GRPO) method and is trained on a dedicated MATH dataset, as detailed in the research paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models".

Key Capabilities

  • Enhanced Mathematical Reasoning: Optimized for solving mathematical problems through specialized training.
  • Co-rewarding Framework: Incorporates a novel self-supervised reinforcement learning (RL) framework designed to improve training stability and elicit reasoning in LLMs.
  • Stable Self-supervised Learning: Addresses common issues like scaling dilemmas and training collapse found in other self-rewarding methods.
  • Two Instantiations: The Co-rewarding framework is implemented via Co-rewarding-I (data-side, using contrastive agreement) and Co-rewarding-II (model-side, using self-distillation with a slowly-updated reference teacher).

Good For

  • Mathematical Problem Solving: Ideal for applications requiring strong mathematical reasoning abilities.
  • Research in RL for LLMs: Serves as a practical example of the Co-rewarding framework for stable self-supervised learning.
  • Benchmarking Reasoning Tasks: Can be used to evaluate and compare performance on complex reasoning benchmarks.