andrewlngdn/dsl-debug-7b-sft-rl

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Mar 12, 2026License:mitArchitecture:Transformer Open Weights Cold

andrewlngdn/dsl-debug-7b-sft-rl is a Qwen2.5-7B-Instruct fine-tuned model developed by Andrew Lngdn, specifically optimized for debugging programs within a custom dataflow DSL. This model utilizes a two-stage training process involving Supervised Fine-Tuning (SFT) followed by GRPO reinforcement learning. It demonstrates strong performance in identifying and correcting errors across standard, nonlocal, and intent-mismatch debugging scenarios. Its primary strength lies in interactive code debugging using a defined set of tools and turns.

Loading preview...

DSL Debug 7B: SFT then RL

This model, developed by Andrew Lngdn, is a Qwen2.5-7B-Instruct variant that has undergone a specialized two-stage fine-tuning process to excel at debugging programs written in a custom dataflow DSL. The training involved initial Supervised Fine-Tuning (SFT) followed by GRPO reinforcement learning, with this release representing the best-performing checkpoint from the RL stage.

Key Capabilities

  • Specialized Debugging: Achieves significantly higher success rates in debugging tasks compared to prompt engineering or SFT-only methods, particularly in complex 'Nonlocal' and 'Intent-Mismatch' error categories.
  • Interactive Tool Use: Designed to interactively debug code using a set of four tools: run(code), inspect(node_name), read_docs(operation), and submit(code) within an 8-turn episode limit.
  • Reinforcement Learning: Leverages GRPO for performance gains, showing substantial improvements over SFT-only and RL-only predecessors.

Performance Highlights

On a held-out test set, this model achieved:

  • 86.1% accuracy on Standard debugging problems.
  • 70.5% accuracy on Nonlocal debugging problems.
  • 28.2% accuracy on Intent-Mismatch debugging problems.

While highly specialized, the model exhibits a minor alignment tax on general benchmarks like MMLU (74.5% vs 74.6% base), GSM8K (84.1% vs 84.9% base), and HumanEval (62.2% vs 65.9% base).

Good For

  • Automated DSL Debugging: Ideal for applications requiring automated identification and correction of errors in custom dataflow DSLs.
  • Research in RL for Code: Provides a strong baseline and example for multi-turn reinforcement learning applied to code debugging tasks.

For more details, refer to the blog post on Multi-Turn RL for Code Debugging and the GitHub repository.