jaygala24/Qwen2.5-3B-RLOO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The jaygala24/Qwen2.5-3B-RLOO-math-reasoning model is a 3.1 billion parameter language model, fine-tuned from Qwen2.5-3B, specifically optimized for mathematical reasoning tasks. It utilizes the RLOO (REINFORCE Leave-One-Out) algorithm without KL penalty, trained on datasets like GSM8K and MATH-500. This model demonstrates strong performance on math reasoning benchmarks, achieving an overall pass@1 of 81.83% and pass@32 of 95.38% across 1819 problems.

Loading preview...

Overview

This model, jaygala24/Qwen2.5-3B-RLOO-math-reasoning, is a specialized 3.1 billion parameter language model derived from Qwen2.5-3B. Its primary distinction lies in its fine-tuning process, which employs the RLOO (REINFORCE Leave-One-Out) algorithm without a KL penalty, specifically targeting enhanced mathematical reasoning capabilities.

Key Capabilities & Training

  • Mathematical Reasoning: Optimized for solving complex math problems, as evidenced by its training on gsm8k_train and math_train datasets.
  • RLOO Algorithm: Utilizes a unique reinforcement learning approach where the advantage baseline is the leave-one-out mean reward, trained with a REINFORCE-style policy loss.
  • Performance: Achieves notable results on math reasoning benchmarks:
    • GSM8K (test): 86.47% pass@1, 97.12% pass@32
    • MATH-500: 69.59% pass@1, 90.80% pass@32
    • Overall: 81.83% pass@1, 95.38% pass@32 across 1819 problems.
  • Context Length: Supports a sequence length of 8192 tokens during training.

Why this model is different

Unlike general-purpose LLMs, this model's specific RLOO fine-tuning makes it particularly adept at step-by-step mathematical problem-solving. Its training methodology and evaluation metrics highlight a focused effort on improving accuracy in arithmetic and algebraic reasoning, making it a strong candidate for applications requiring reliable mathematical output.