andrewlngdn/dsl-debug-7b-sft-rl
andrewlngdn/dsl-debug-7b-sft-rl is a Qwen2.5-7B-Instruct fine-tuned model developed by Andrew Lngdn, specifically optimized for debugging programs within a custom dataflow DSL. This model utilizes a two-stage training process involving Supervised Fine-Tuning (SFT) followed by GRPO reinforcement learning. It demonstrates strong performance in identifying and correcting errors across standard, nonlocal, and intent-mismatch debugging scenarios. Its primary strength lies in interactive code debugging using a defined set of tools and turns.
Loading preview...
DSL Debug 7B: SFT then RL
This model, developed by Andrew Lngdn, is a Qwen2.5-7B-Instruct variant that has undergone a specialized two-stage fine-tuning process to excel at debugging programs written in a custom dataflow DSL. The training involved initial Supervised Fine-Tuning (SFT) followed by GRPO reinforcement learning, with this release representing the best-performing checkpoint from the RL stage.
Key Capabilities
- Specialized Debugging: Achieves significantly higher success rates in debugging tasks compared to prompt engineering or SFT-only methods, particularly in complex 'Nonlocal' and 'Intent-Mismatch' error categories.
- Interactive Tool Use: Designed to interactively debug code using a set of four tools:
run(code),inspect(node_name),read_docs(operation), andsubmit(code)within an 8-turn episode limit. - Reinforcement Learning: Leverages GRPO for performance gains, showing substantial improvements over SFT-only and RL-only predecessors.
Performance Highlights
On a held-out test set, this model achieved:
- 86.1% accuracy on Standard debugging problems.
- 70.5% accuracy on Nonlocal debugging problems.
- 28.2% accuracy on Intent-Mismatch debugging problems.
While highly specialized, the model exhibits a minor alignment tax on general benchmarks like MMLU (74.5% vs 74.6% base), GSM8K (84.1% vs 84.9% base), and HumanEval (62.2% vs 65.9% base).
Good For
- Automated DSL Debugging: Ideal for applications requiring automated identification and correction of errors in custom dataflow DSLs.
- Research in RL for Code: Provides a strong baseline and example for multi-turn reinforcement learning applied to code debugging tasks.
For more details, refer to the blog post on Multi-Turn RL for Code Debugging and the GitHub repository.