TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-DAPO14k

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Oct 3, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The TMLR-Group-HF/Majority-Voting-Qwen3-8B-Base-DAPO14k is an 8 billion parameter Qwen3-based language model, specifically trained on the DAPO-14k dataset. This model is a product of research into stable self-supervised reinforcement learning for eliciting reasoning in large language models. Its primary differentiation lies in its training methodology, utilizing a majority-voting approach to enhance reasoning capabilities. It is designed for tasks requiring advanced reasoning, as explored in the associated research paper.

Loading preview...

Model Overview

This model, Majority-Voting: Qwen3-8B-Base-DAPO14k, is an 8 billion parameter language model built upon the Qwen3 architecture. It has been specifically trained using the DAPO-14k dataset as part of research into advanced reasoning capabilities in large language models.

Key Capabilities

  • Enhanced Reasoning: The model's training incorporates a majority-voting mechanism, a technique explored in the paper "Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models" (arXiv:2508.00410). This method aims to improve the model's ability to perform complex reasoning tasks.
  • Self-supervised RL: It leverages stable self-supervised reinforcement learning, a novel approach to training that allows the model to learn and refine its reasoning processes without extensive human supervision.

Good For

  • Research in LLM Reasoning: Ideal for researchers and developers interested in exploring and applying advanced reasoning techniques in large language models.
  • Applications Requiring Logical Inference: Suitable for use cases where robust logical inference and problem-solving are critical, benefiting from its specialized training methodology.
  • Experimentation with Co-rewarding: Provides a practical implementation for those looking to experiment with the 'Co-rewarding' framework for eliciting reasoning.