TMLR-Group-HF/GT-Qwen3-8B-Base-MATH

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Aug 5, 2025License:mitArchitecture:Transformer0.0K Open Weights Warm

The TMLR-Group-HF/GT-Qwen3-8B-Base-MATH model is an 8 billion parameter Qwen3-Base variant, developed by TMLR-Group-HF. It is specifically trained using the GRPO Ground Truth method with a MATH training set, as detailed in the Co-rewarding paper. This model is optimized for eliciting reasoning in large language models, making it particularly suitable for complex mathematical and reasoning tasks. It features a 32768 token context length, enhancing its ability to handle extensive problem descriptions.

Loading preview...

Model Overview

The TMLR-Group-HF/GT-Qwen3-8B-Base-MATH is an 8 billion parameter model based on the Qwen3-Base architecture. It was developed by TMLR-Group-HF using the GRPO Ground Truth method, specifically leveraging a MATH training set. This training approach is detailed in the research paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models" (arXiv:2508.00410). The model's training methodology, known as Co-rewarding, focuses on stable self-supervised reinforcement learning to enhance reasoning capabilities in large language models.

Key Capabilities

  • Enhanced Reasoning: Optimized for complex reasoning tasks, particularly in mathematical domains.
  • Specialized Training: Utilizes the GRPO Ground Truth method with a dedicated MATH dataset.
  • Self-supervised RL: Incorporates Co-rewarding for stable self-supervised reinforcement learning.
  • Context Length: Supports a substantial context window of 32768 tokens.

Good For

  • Mathematical Problem Solving: Excels in tasks requiring logical and mathematical reasoning.
  • Research in Reasoning: Ideal for researchers exploring advanced reasoning elicitation techniques in LLMs.
  • Complex Task Handling: Suitable for applications that benefit from a large context window and robust reasoning.