DSL Debug 7B: SFT then RL

This model, developed by Andrew Lngdn, is a Qwen2.5-7B-Instruct variant that has undergone a specialized two-stage fine-tuning process to excel at debugging programs written in a custom dataflow DSL. The training involved initial Supervised Fine-Tuning (SFT) followed by GRPO reinforcement learning, with this release representing the best-performing checkpoint from the RL stage.

Key Capabilities

Specialized Debugging: Achieves significantly higher success rates in debugging tasks compared to prompt engineering or SFT-only methods, particularly in complex 'Nonlocal' and 'Intent-Mismatch' error categories.
Interactive Tool Use: Designed to interactively debug code using a set of four tools: run(code), inspect(node_name), read_docs(operation), and submit(code) within an 8-turn episode limit.
Reinforcement Learning: Leverages GRPO for performance gains, showing substantial improvements over SFT-only and RL-only predecessors.

Performance Highlights

On a held-out test set, this model achieved:

86.1% accuracy on Standard debugging problems.
70.5% accuracy on Nonlocal debugging problems.
28.2% accuracy on Intent-Mismatch debugging problems.

While highly specialized, the model exhibits a minor alignment tax on general benchmarks like MMLU (74.5% vs 74.6% base), GSM8K (84.1% vs 84.9% base), and HumanEval (62.2% vs 65.9% base).

Good For

Automated DSL Debugging: Ideal for applications requiring automated identification and correction of errors in custom dataflow DSLs.
Research in RL for Code: Provides a strong baseline and example for multi-turn reinforcement learning applied to code debugging tasks.

For more details, refer to the blog post on Multi-Turn RL for Code Debugging and the GitHub repository.

Overview

DSL Debug 7B: SFT then RL

Key Capabilities

Performance Highlights

Good For

Full Model Card (README)