Name: TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: TMLR-Group-HF

Model Overview

This model, TMLR-Group-HF/Co-rewarding-I-Qwen3-8B-Base-DAPO14k, is an 8 billion parameter Qwen3-Base architecture. It was developed by Co-rewarding-I and fine-tuned on the DAPO-14k dataset. The core innovation lies in its training methodology, which utilizes the Co-rewarding-I framework.

Key Capabilities

Enhanced Reasoning: The model is specifically designed to improve reasoning abilities in large language models (LLMs).
Stable Self-supervised RL: It employs Co-rewarding-I, a data-side instantiation of a novel self-supervised reinforcement learning (RL) framework. This approach aims to provide more stable training compared to traditional single-view self-rewarding methods.
Mitigation of Training Issues: Co-rewarding-I addresses common problems like training collapse and reward hacking by deriving reward signals from contrastive agreement across semantically analogous questions.
Mathematical Reasoning: The framework is particularly effective for complex challenges such as mathematical reasoning.

What Makes This Model Different?

This model stands out due to its Co-rewarding-I training framework, which is detailed in the paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models" [https://huggingface.co/papers/2508.00410]. Unlike many other LLMs, its training focuses on a robust self-supervised RL approach to elicit and stabilize reasoning capabilities, rather than relying solely on supervised fine-tuning or simpler RL methods. This makes it particularly suitable for applications requiring reliable and consistent reasoning performance.

Overview

Model Overview

Key Capabilities

What Makes This Model Different?

Full Model Card (README)