DSL Debug 7B: SFT then RL
This model, developed by Andrew Lngdn, is a Qwen2.5-7B-Instruct variant that has undergone a specialized two-stage fine-tuning process to excel at debugging programs written in a custom dataflow DSL. The training involved initial Supervised Fine-Tuning (SFT) followed by GRPO reinforcement learning, with this release representing the best-performing checkpoint from the RL stage.
Key Capabilities
- Specialized Debugging: Achieves significantly higher success rates in debugging tasks compared to prompt engineering or SFT-only methods, particularly in complex 'Nonlocal' and 'Intent-Mismatch' error categories.
- Interactive Tool Use: Designed to interactively debug code using a set of four tools:
run(code), inspect(node_name), read_docs(operation), and submit(code) within an 8-turn episode limit. - Reinforcement Learning: Leverages GRPO for performance gains, showing substantial improvements over SFT-only and RL-only predecessors.
Performance Highlights
On a held-out test set, this model achieved:
- 86.1% accuracy on Standard debugging problems.
- 70.5% accuracy on Nonlocal debugging problems.
- 28.2% accuracy on Intent-Mismatch debugging problems.
While highly specialized, the model exhibits a minor alignment tax on general benchmarks like MMLU (74.5% vs 74.6% base), GSM8K (84.1% vs 84.9% base), and HumanEval (62.2% vs 65.9% base).
Good For
- Automated DSL Debugging: Ideal for applications requiring automated identification and correction of errors in custom dataflow DSLs.
- Research in RL for Code: Provides a strong baseline and example for multi-turn reinforcement learning applied to code debugging tasks.
For more details, refer to the blog post on Multi-Turn RL for Code Debugging and the GitHub repository.