jaygala24/Qwen3-4B-RLOO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The jaygala24/Qwen3-4B-RLOO-math-reasoning model is a 4 billion parameter Qwen3-based causal language model fine-tuned specifically for mathematical reasoning tasks. Utilizing RLOO (REINFORCE Leave-One-Out) without KL penalty, it demonstrates strong performance on benchmarks like GSM8K and MATH-500. With a 32768-token context length, this model is optimized for accurate step-by-step problem-solving in mathematics.

Loading preview...

Model Overview

This model, jaygala24/Qwen3-4B-RLOO-math-reasoning, is a 4 billion parameter variant of the Qwen3-4B base model, specifically fine-tuned for enhanced mathematical reasoning capabilities. It leverages a unique Reinforcement Learning approach called RLOO (REINFORCE Leave-One-Out), which uses a leave-one-out mean reward as the advantage baseline and operates without a KL penalty, distinguishing its training methodology from many other RLHF models.

Key Capabilities & Training

  • Mathematical Reasoning: The model is explicitly trained on gsm8k_train and math_train datasets, focusing on arithmetic and advanced mathematical problems.
  • RLOO Algorithm: Employs a REINFORCE-style policy loss with a group-structured RLOO algorithm, where each response's advantage is calculated against the mean of other responses in its group.
  • Performance: Achieves high pass@k scores on mathematical benchmarks:
    • GSM8K (test): 90.08% pass@1, 97.73% pass@32
    • MATH-500: 79.19% pass@1, 96.00% pass@32
    • Overall: 87.09% pass@1, 97.25% pass@32
  • Context Length: Supports a substantial context window of 32768 tokens, beneficial for complex multi-step problems.

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Accurate mathematical problem-solving.
  • Step-by-step reasoning in quantitative tasks.
  • Integration into systems where robust mathematical capabilities are critical.