jaygala24/Qwen2.5-0.5B-RLOO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen2.5-0.5B-RLOO-math-reasoning is a 0.5 billion parameter Qwen2.5-based causal language model fine-tuned by jaygala24 using RLOO (REINFORCE Leave-One-Out) without KL penalty. This model is specifically optimized for mathematical reasoning tasks, demonstrating strong performance on benchmarks like GSM8K and MATH-500. With a 32K context length, it excels at generating step-by-step mathematical solutions.

Loading preview...

Model Overview

This model, jaygala24/Qwen2.5-0.5B-RLOO-math-reasoning, is a specialized fine-tuned version of the Qwen2.5-0.5B base model. Its primary focus is mathematical reasoning, achieved through a unique training approach using RLOO (REINFORCE Leave-One-Out) without KL penalty.

Key Capabilities & Training

  • Mathematical Reasoning: The model is explicitly trained and optimized for solving mathematical problems, as evidenced by its evaluation on gsm8k and math datasets.
  • RLOO Algorithm: It leverages the RLOO algorithm, which uses a leave-one-out mean reward as a baseline for policy loss, enhancing its ability to generate accurate mathematical steps.
  • Performance: Achieves notable pass@k scores on mathematical benchmarks:
    • GSM8K (test): 89.69% pass@32
    • MATH-500: 75.00% pass@32
    • Overall: 85.65% pass@32 across 1819 problems.
  • Training Framework: Developed using PipelineRL, with a sequence length of 8192 and trained for 1500 steps.

Good For

  • Applications requiring accurate step-by-step mathematical problem-solving.
  • Integration into systems where mathematical reasoning is a core component.
  • Researchers exploring RL-based fine-tuning methods for specialized tasks, particularly RLOO without KL penalty.