DSL Debug 7B — RL-Only Step 30
This model, andrewlngdn/dsl-debug-7b-rl-only-step30, is a 7.6 billion parameter variant of the Qwen2.5-7B-Instruct architecture, uniquely trained by andrewlngdn using Group Relative Policy Optimization (GRPO). A key differentiator is its direct training from the base model, completely bypassing the traditional Supervised Fine-Tuning (SFT) warmup phase, focusing purely on reinforcement learning for enhanced debugging capabilities.
Key Capabilities
- Specialized Code Debugging: Achieves substantial improvements in multi-turn code debugging tasks.
- Enhanced Performance on Debugging Splits: Shows significant gains on standard (78.8% vs 50.5%), nonlocal (54.0% vs 12.0%), and intent-mismatch (14.7% vs 0.6%) test sets compared to the base model.
- GRPO Training: Utilizes GRPO with multi-turn tool use, indicating a robust approach to interactive problem-solving.
Good For
- Automated Code Correction: Ideal for applications requiring models to identify and fix errors in code.
- Debugging Tools: Can be integrated into development environments for intelligent debugging assistance.
- Research in RL for Code: Provides a strong baseline for exploring reinforcement learning techniques in code generation and debugging, particularly without SFT pre-training.
Despite its strong debugging performance, the model exhibits a slight "alignment tax" on general benchmarks like GSM8K and HumanEval, suggesting its specialization comes with minor trade-offs in broader reasoning or coding tasks. For more details, refer to the Multi-Turn RL for Code Debugging blog post.