jaygala24/Qwen2.5-3B-RLOO-math-reasoning
The jaygala24/Qwen2.5-3B-RLOO-math-reasoning model is a 3.1 billion parameter language model, fine-tuned from Qwen2.5-3B, specifically optimized for mathematical reasoning tasks. It utilizes the RLOO (REINFORCE Leave-One-Out) algorithm without KL penalty, trained on datasets like GSM8K and MATH-500. This model demonstrates strong performance on math reasoning benchmarks, achieving an overall pass@1 of 81.83% and pass@32 of 95.38% across 1819 problems.
Loading preview...
Overview
This model, jaygala24/Qwen2.5-3B-RLOO-math-reasoning, is a specialized 3.1 billion parameter language model derived from Qwen2.5-3B. Its primary distinction lies in its fine-tuning process, which employs the RLOO (REINFORCE Leave-One-Out) algorithm without a KL penalty, specifically targeting enhanced mathematical reasoning capabilities.
Key Capabilities & Training
- Mathematical Reasoning: Optimized for solving complex math problems, as evidenced by its training on
gsm8k_trainandmath_traindatasets. - RLOO Algorithm: Utilizes a unique reinforcement learning approach where the advantage baseline is the leave-one-out mean reward, trained with a REINFORCE-style policy loss.
- Performance: Achieves notable results on math reasoning benchmarks:
- GSM8K (test): 86.47% pass@1, 97.12% pass@32
- MATH-500: 69.59% pass@1, 90.80% pass@32
- Overall: 81.83% pass@1, 95.38% pass@32 across 1819 problems.
- Context Length: Supports a sequence length of 8192 tokens during training.
Why this model is different
Unlike general-purpose LLMs, this model's specific RLOO fine-tuning makes it particularly adept at step-by-step mathematical problem-solving. Its training methodology and evaluation metrics highlight a focused effort on improving accuracy in arithmetic and algebraic reasoning, making it a strong candidate for applications requiring reliable mathematical output.