hbx/JustRL-Nemotron-1.5B
hbx/JustRL-Nemotron-1.5B is a 1.5 billion parameter language model developed by hbx, fine-tuned from OpenMath-Nemotron-1.5B using a simplified Reinforcement Learning (RL) approach. This model excels in mathematical reasoning tasks, achieving competitive performance with a single-stage training pipeline and fixed hyperparameters. It demonstrates robust and stable improvement, offering an efficient solution for mathematical problem-solving at its scale.
Loading preview...
Overview
hbx/JustRL-Nemotron-1.5B is a 1.5 billion parameter language model, part of the JustRL family, developed by hbx. It is fine-tuned from OpenMath-Nemotron-1.5B using a novel, simplified Reinforcement Learning (RL) recipe. The core innovation lies in its ability to achieve competitive performance in mathematical reasoning tasks without complex multi-stage pipelines or dynamic schedules, relying instead on single-stage training with fixed hyperparameters.
Key Capabilities & Differentiators
- Simplicity: Utilizes a single-stage training process with fixed hyperparameters, avoiding complex multi-stage pipelines.
- Stability: Exhibits smooth, monotonic performance improvement over thousands of training steps without collapses or oscillations.
- Performance: Achieves state-of-the-art results at the 1.5B scale for mathematical reasoning, matching or exceeding more complex approaches.
- Efficiency: Delivers comparable or superior performance with significantly less compute (2x less than some multi-stage methods).
- Robustness: The approach is robust, demonstrated by applying identical hyperparameters to different base models (DeepSeek and Nemotron) without per-model tuning.
- Mathematical Reasoning: Specifically optimized for mathematical problem-solving, showing strong results on benchmarks like AIME24, AMC23, and MATH-500.
Training Details
The model is trained using a standard GRPO algorithm with binary outcome rewards and a simple DAPO verifier for reward calculation. It uses the DAPO-Math-17k dataset without filtering or dynamic sampling, and maintains a 16K context cap. The training recipe emphasizes minimalism and stability, as detailed in their paper.
Good For
- Developers seeking an efficient and performant 1.5B parameter model for mathematical reasoning tasks.
- Applications requiring robust and stable RL-tuned models without the overhead of complex training methodologies.
- Research into simplified and scalable RL approaches for language models.