sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH
The sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH model is a 14 billion parameter Qwen3-based large language model fine-tuned using the Intuitor method on the MATH dataset. Developed by sunblaze-ucb, this model leverages Reinforcement Learning from Internal Feedback (RLIF) to learn reasoning skills using self-certainty as the sole reward, without external supervision. It is specifically optimized for mathematical problem-solving and reasoning tasks, offering a scalable approach for domains where labeled data is scarce. The model supports a context length of 32768 tokens.
Loading preview...
Overview
This model, sunblaze-ucb/Qwen3-14B-Intuitor-MATH-1EPOCH, is a 14 billion parameter variant of the Qwen3 architecture. It has been fine-tuned using the Intuitor method, a novel reinforcement learning approach, specifically on the MATH dataset. Intuitor operates on the principle of Reinforcement Learning from Internal Feedback (RLIF), which allows the model to learn and improve its reasoning capabilities by using its own internal confidence (self-certainty) as a reward signal, rather than relying on external rewards or labeled data.
Key Capabilities
- Mathematical Reasoning: Optimized for solving mathematical problems, as it was trained on the MATH dataset.
- Self-Supervised Learning: Utilizes RLIF, enabling learning from intrinsic signals without the need for expensive external supervision.
- Scalable Fine-tuning: Offers a domain-agnostic and scalable fine-tuning approach, particularly beneficial in scenarios where labeled data is limited or unavailable.
Good For
- Mathematical Problem Solving: Ideal for applications requiring robust mathematical reasoning.
- Research in RLIF: Demonstrates the effectiveness of learning from internal feedback for enhancing LLM capabilities.
- Environments with Limited Supervision: Suitable for tasks where obtaining external rewards or extensive labeled datasets is challenging.
For more technical details, refer to the associated paper: Learning to Reason without External Rewards and the GitHub Repository.