jaygala24/Qwen3-1.7B-RLOO-math-reasoning
jaygala24/Qwen3-1.7B-RLOO-math-reasoning is a 1.7 billion parameter Qwen3-based causal language model fine-tuned by jaygala24. It is specifically optimized for mathematical reasoning tasks, utilizing the RLOO (REINFORCE Leave-One-Out) algorithm without KL penalty. The model demonstrates strong performance on benchmarks like GSM8K and MATH-500, achieving an overall pass@32 of 95.88%. With a 32768 token context length, it is designed for complex problem-solving in mathematics.
Loading preview...
Model Overview
This model, jaygala24/Qwen3-1.7B-RLOO-math-reasoning, is a specialized fine-tuned version of the Qwen3-1.7B base model. It has been specifically optimized for mathematical reasoning tasks using the RLOO (REINFORCE Leave-One-Out) algorithm without KL penalty, a reinforcement learning approach that leverages a leave-one-out mean reward as a baseline for policy loss. The training was conducted using the PipelineRL framework.
Key Capabilities & Performance
- Mathematical Reasoning: Excels at solving mathematical problems, as evidenced by its strong performance on standard benchmarks.
- RLOO Optimization: Utilizes a unique reinforcement learning strategy to enhance reasoning capabilities.
- Benchmark Results: Achieves competitive pass@k scores on challenging datasets:
- GSM8K (test): 96.66% pass@32
- MATH-500: 93.80% pass@32
- Overall: 95.88% pass@32 across 1819 problems.
Training Details
- Datasets: Trained on
gsm8k_trainandmath_traindatasets. - Algorithm: RLOO with a REINFORCE-style policy loss,
0.0KL Coefficient, and0.02Epsilon (clip). - Hyperparameters: Trained with a learning rate of
1e-06,bf16precision, and DeepSpeed ZeRO Stage 3 for efficiency.
Ideal Use Cases
- Automated Math Problem Solving: Generating step-by-step solutions for arithmetic and algebraic problems.
- Educational Tools: Assisting in the development of AI tutors or problem-solving aids for mathematics.
- Research in RL for Reasoning: A strong baseline or component for further research into reinforcement learning applications for complex reasoning tasks.