JustRL-DeepSeek-1.5B Overview

JustRL-DeepSeek-1.5B is a 1.5 billion parameter language model developed by hbx, derived from DeepSeek-R1-Distill-Qwen-1.5B. This model showcases that competitive Reinforcement Learning (RL) performance for small language models can be achieved without complex multi-stage pipelines or dynamic schedules. It utilizes a minimal recipe involving single-stage training and fixed hyperparameters, leading to robust and stable performance.

Key Capabilities & Features

Simplicity: Employs a single-stage training process with fixed hyperparameters, avoiding complex multi-stage pipelines.
Stability: Exhibits smooth, monotonic improvement over thousands of training steps without collapses or oscillations.
Performance: Achieves state-of-the-art results at the 1.5B scale on mathematical reasoning benchmarks, matching or exceeding more complex approaches.
Efficiency: Delivers comparable or superior performance with significantly less computational cost (e.g., 2x less compute than some multi-stage methods).
Openness: Model weights and complete evaluation scripts are publicly released.
Training: Uses a standard GRPO algorithm with binary outcome rewards and a simple DAPO verifier, trained on the DAPO-Math-17k dataset without filtering or dynamic sampling.

Ideal Use Cases

Mathematical Reasoning: Excels in solving complex mathematical problems, as evidenced by its strong performance on benchmarks like AIME24, AIME25, AMC23, and MATH-500.
Resource-Constrained Environments: Suitable for applications where computational efficiency is critical, due to its optimized training approach.
Research in RL for LLMs: Provides a strong baseline and a demonstration of effective, simplified RL techniques for language models.