jaygala24/Qwen3-4B-RLOO-math-reasoning
The jaygala24/Qwen3-4B-RLOO-math-reasoning model is a 4 billion parameter Qwen3-based causal language model fine-tuned specifically for mathematical reasoning tasks. Utilizing RLOO (REINFORCE Leave-One-Out) without KL penalty, it demonstrates strong performance on benchmarks like GSM8K and MATH-500. With a 32768-token context length, this model is optimized for accurate step-by-step problem-solving in mathematics.
Loading preview...
Model Overview
This model, jaygala24/Qwen3-4B-RLOO-math-reasoning, is a 4 billion parameter variant of the Qwen3-4B base model, specifically fine-tuned for enhanced mathematical reasoning capabilities. It leverages a unique Reinforcement Learning approach called RLOO (REINFORCE Leave-One-Out), which uses a leave-one-out mean reward as the advantage baseline and operates without a KL penalty, distinguishing its training methodology from many other RLHF models.
Key Capabilities & Training
- Mathematical Reasoning: The model is explicitly trained on
gsm8k_trainandmath_traindatasets, focusing on arithmetic and advanced mathematical problems. - RLOO Algorithm: Employs a REINFORCE-style policy loss with a group-structured RLOO algorithm, where each response's advantage is calculated against the mean of other responses in its group.
- Performance: Achieves high pass@k scores on mathematical benchmarks:
- GSM8K (test): 90.08% pass@1, 97.73% pass@32
- MATH-500: 79.19% pass@1, 96.00% pass@32
- Overall: 87.09% pass@1, 97.25% pass@32
- Context Length: Supports a substantial context window of 32768 tokens, beneficial for complex multi-step problems.
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Accurate mathematical problem-solving.
- Step-by-step reasoning in quantitative tasks.
- Integration into systems where robust mathematical capabilities are critical.