Name: TMLR-Group-HF/GT-Qwen3-4B-Base-MATH API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: TMLR-Group-HF

Model Overview

TMLR-Group-HF/GT-Qwen3-4B-Base-MATH is a 4 billion parameter Qwen3-Base model developed by TMLR-Group-HF, specifically fine-tuned for mathematical reasoning. This model utilizes the Ground Truth (GRPO) method and is trained on a dedicated MATH dataset, as detailed in the research paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models".

Key Capabilities

Enhanced Mathematical Reasoning: Optimized for solving mathematical problems through specialized training.
Co-rewarding Framework: Incorporates a novel self-supervised reinforcement learning (RL) framework designed to improve training stability and elicit reasoning in LLMs.
Stable Self-supervised Learning: Addresses common issues like scaling dilemmas and training collapse found in other self-rewarding methods.
Two Instantiations: The Co-rewarding framework is implemented via Co-rewarding-I (data-side, using contrastive agreement) and Co-rewarding-II (model-side, using self-distillation with a slowly-updated reference teacher).

Good For

Mathematical Problem Solving: Ideal for applications requiring strong mathematical reasoning abilities.
Research in RL for LLMs: Serves as a practical example of the Co-rewarding framework for stable self-supervised learning.
Benchmarking Reasoning Tasks: Can be used to evaluate and compare performance on complex reasoning benchmarks.

Overview

Model Overview

Key Capabilities

Good For

Full Model Card (README)