jaygala24/Qwen2.5-1.5B-RLOO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen2.5-1.5B-RLOO-math-reasoning is a 1.5 billion parameter causal language model, fine-tuned from Qwen2.5-1.5B. It specializes in mathematical reasoning tasks, utilizing the RLOO (REINFORCE Leave-One-Out) algorithm without KL penalty during training. This model demonstrates strong performance on benchmarks like GSM8K and MATH-500, making it suitable for applications requiring precise mathematical problem-solving. It was trained with a context length of 8192 tokens.

Loading preview...

Model Overview

This model, jaygala24/Qwen2.5-1.5B-RLOO-math-reasoning, is a 1.5 billion parameter language model derived from Qwen2.5-1.5B. Its primary distinction lies in its fine-tuning using the RLOO (REINFORCE Leave-One-Out) algorithm without KL penalty, specifically optimized for mathematical reasoning tasks. The training leveraged the PipelineRL framework.

Key Capabilities & Performance

The model excels in mathematical problem-solving, as evidenced by its evaluation results on standard benchmarks:

  • GSM8K (test): Achieved 78.44% pass@1 and 96.06% pass@32.
  • MATH-500: Achieved 60.14% pass@1 and 89.80% pass@32.
  • Overall: Demonstrated 73.41% pass@1 and 94.34% pass@32 across 1819 problems.

These results are based on generating 32 samples per problem with a temperature of 1.0. The RLOO algorithm uses a leave-one-out mean reward as the baseline for its REINFORCE-style policy loss, contributing to its specialized reasoning abilities.

Good For

  • Applications requiring robust mathematical reasoning.
  • Tasks involving step-by-step problem-solving in mathematics.
  • Developers looking for a compact model (1.5B parameters) with strong math capabilities.