Name: andrewlngdn/dsl-debug-7b-rl-only-step30 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: andrewlngdn

DSL Debug 7B — RL-Only Step 30

This model, andrewlngdn/dsl-debug-7b-rl-only-step30, is a 7.6 billion parameter variant of the Qwen2.5-7B-Instruct architecture, uniquely trained by andrewlngdn using Group Relative Policy Optimization (GRPO). A key differentiator is its direct training from the base model, completely bypassing the traditional Supervised Fine-Tuning (SFT) warmup phase, focusing purely on reinforcement learning for enhanced debugging capabilities.

Key Capabilities

Specialized Code Debugging: Achieves substantial improvements in multi-turn code debugging tasks.
Enhanced Performance on Debugging Splits: Shows significant gains on standard (78.8% vs 50.5%), nonlocal (54.0% vs 12.0%), and intent-mismatch (14.7% vs 0.6%) test sets compared to the base model.
GRPO Training: Utilizes GRPO with multi-turn tool use, indicating a robust approach to interactive problem-solving.

Good For

Automated Code Correction: Ideal for applications requiring models to identify and fix errors in code.
Debugging Tools: Can be integrated into development environments for intelligent debugging assistance.
Research in RL for Code: Provides a strong baseline for exploring reinforcement learning techniques in code generation and debugging, particularly without SFT pre-training.

Despite its strong debugging performance, the model exhibits a slight "alignment tax" on general benchmarks like GSM8K and HumanEval, suggesting its specialization comes with minor trade-offs in broader reasoning or coding tasks. For more details, refer to the Multi-Turn RL for Code Debugging blog post.

Overview

DSL Debug 7B — RL-Only Step 30

Key Capabilities

Good For

Full Model Card (README)