jaygala24/Qwen2.5-0.5B-RLOO-math-reasoning
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026License:apache-2.0Architecture:Transformer Open Weights Cold
jaygala24/Qwen2.5-0.5B-RLOO-math-reasoning is a 0.5 billion parameter Qwen2.5-based causal language model fine-tuned by jaygala24 using RLOO (REINFORCE Leave-One-Out) without KL penalty. This model is specifically optimized for mathematical reasoning tasks, demonstrating strong performance on benchmarks like GSM8K and MATH-500. With a 32K context length, it excels at generating step-by-step mathematical solutions.
Loading preview...
Model Overview
This model, jaygala24/Qwen2.5-0.5B-RLOO-math-reasoning, is a specialized fine-tuned version of the Qwen2.5-0.5B base model. Its primary focus is mathematical reasoning, achieved through a unique training approach using RLOO (REINFORCE Leave-One-Out) without KL penalty.
Key Capabilities & Training
- Mathematical Reasoning: The model is explicitly trained and optimized for solving mathematical problems, as evidenced by its evaluation on
gsm8kandmathdatasets. - RLOO Algorithm: It leverages the RLOO algorithm, which uses a leave-one-out mean reward as a baseline for policy loss, enhancing its ability to generate accurate mathematical steps.
- Performance: Achieves notable pass@k scores on mathematical benchmarks:
- GSM8K (test): 89.69% pass@32
- MATH-500: 75.00% pass@32
- Overall: 85.65% pass@32 across 1819 problems.
- Training Framework: Developed using PipelineRL, with a sequence length of 8192 and trained for 1500 steps.
Good For
- Applications requiring accurate step-by-step mathematical problem-solving.
- Integration into systems where mathematical reasoning is a core component.
- Researchers exploring RL-based fine-tuning methods for specialized tasks, particularly RLOO without KL penalty.