dmaheshwar22/qwen-1.5b-coder-grpo-scratch-step200
The dmaheshwar22/qwen-1.5b-coder-grpo-scratch-step200 model is a 1.5 billion parameter Qwen-2.5-Coder variant, fine-tuned using Group-Relative Policy Optimization (GRPO) with verifiable rewards from sandboxed test execution. It is specifically optimized for code generation tasks, particularly Python, and features a 32768-token context length. This model serves as a pipeline-validation run for GRPO techniques, demonstrating its application on a smaller scale for research and educational purposes.
Loading preview...
Model Overview
This model, dmaheshwar22/qwen-1.5b-coder-grpo-scratch-step200, is a 1.5 billion parameter variant of the Qwen/Qwen2.5-Coder-1.5B-Instruct base model. It has been fine-tuned using Group-Relative Policy Optimization (GRPO), a technique that leverages verifiable rewards from sandboxed test execution, similar to methods used in DeepSeek-R1 and Kimi-K1.5. This particular release is a pipeline-validation run, trained from scratch (without SFT warm-start) for 200 steps on a single A100 GPU.
Key Capabilities & Training Details
- Architecture: Qwen-2.5-Coder-1.5B, optimized for code generation.
- Training Method: GRPO, using the verl framework, with a focus on verifiable rewards.
- Reward Function: A composite reward system based on sandboxed Docker execution, incorporating:
- Test-pass rate (primary signal)
- Linting bonuses (
ruff) - Length penalties
- Compile-error penalties
- Context Length: Supports a notable 32768 tokens.
- Performance (HumanEval+ pass@1): Achieves 0.6415, which is a modest improvement over the base (0.627) and SFT (0.638) baselines, indicating its potential for further optimization.
Intended Use Cases
- Research and Education: This model serves as a concrete reference for understanding end-to-end GRPO implementation with verifiable rewards on a small, open-source coder model. The reward function, sandbox, and training configuration are open-source in the companion repository.
- Not for Production: Due to its early-stage training (200 steps from base), its performance is comparable to the SFT baseline. A more advanced, SFT-warmstarted version is planned for production use.
Limitations
- Coding-only: Specialized for Python coding tasks; not designed for general-purpose chat or reasoning.
- Output Format: May occasionally wrap code in markdown fences, requiring post-processing.
- Safety: Not safety-tuned; inherits behaviors from the base instruct model.