hbx/JustRL-Nemotron-1.5B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Oct 31, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

hbx/JustRL-Nemotron-1.5B is a 1.5 billion parameter language model developed by hbx, fine-tuned from OpenMath-Nemotron-1.5B using a simplified Reinforcement Learning (RL) approach. This model excels in mathematical reasoning tasks, achieving competitive performance with a single-stage training pipeline and fixed hyperparameters. It demonstrates robust and stable improvement, offering an efficient solution for mathematical problem-solving at its scale.

Loading preview...

Overview

hbx/JustRL-Nemotron-1.5B is a 1.5 billion parameter language model, part of the JustRL family, developed by hbx. It is fine-tuned from OpenMath-Nemotron-1.5B using a novel, simplified Reinforcement Learning (RL) recipe. The core innovation lies in its ability to achieve competitive performance in mathematical reasoning tasks without complex multi-stage pipelines or dynamic schedules, relying instead on single-stage training with fixed hyperparameters.

Key Capabilities & Differentiators

  • Simplicity: Utilizes a single-stage training process with fixed hyperparameters, avoiding complex multi-stage pipelines.
  • Stability: Exhibits smooth, monotonic performance improvement over thousands of training steps without collapses or oscillations.
  • Performance: Achieves state-of-the-art results at the 1.5B scale for mathematical reasoning, matching or exceeding more complex approaches.
  • Efficiency: Delivers comparable or superior performance with significantly less compute (2x less than some multi-stage methods).
  • Robustness: The approach is robust, demonstrated by applying identical hyperparameters to different base models (DeepSeek and Nemotron) without per-model tuning.
  • Mathematical Reasoning: Specifically optimized for mathematical problem-solving, showing strong results on benchmarks like AIME24, AMC23, and MATH-500.

Training Details

The model is trained using a standard GRPO algorithm with binary outcome rewards and a simple DAPO verifier for reward calculation. It uses the DAPO-Math-17k dataset without filtering or dynamic sampling, and maintains a 16K context cap. The training recipe emphasizes minimalism and stability, as detailed in their paper.

Good For

  • Developers seeking an efficient and performant 1.5B parameter model for mathematical reasoning tasks.
  • Applications requiring robust and stable RL-tuned models without the overhead of complex training methodologies.
  • Research into simplified and scalable RL approaches for language models.