What is Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-toml-xml-yaml-dpo?
This model is a 4 billion parameter Qwen3-based language model that has undergone extensive fine-tuning, specifically using Direct Preference Optimization (DPO). It builds upon a base Qwen3-4B-Instruct model, followed by an SFT (Structured data generation/conversion with Chain-of-Thought) phase, and two rounds of DPO. The second DPO round, which defines this model, focuses on format-specific preference optimization for structured data.
Key Capabilities & Optimizations
- Enhanced Structured Data Generation: Optimized to produce clean and correctly formatted outputs in TOML, YAML, XML, JSON, and CSV.
- Error Avoidance: Explicitly trained to reject common issues such as unclosed XML tags, unescaped characters, incorrect YAML indentation, JSON wrapped in codeblocks, and mixed explanation text.
- Improved Output Cleanliness: Significantly reduces instances of markdown codeblock wrapping, mixed explanation text, and language mixing compared to its baseline.
- Chain-of-Thought (CoT) Reasoning: Generates structured CoT before each output, improving the structural accuracy of generated data.
- Factual Error Avoidance: Generates fictional/synthetic data to prevent factual inaccuracies often seen when attempting real-world data generation.
Performance Highlights
While the overall parse success rate on a public benchmark is comparable to the baseline (91.3% vs 92.0%), this model achieves higher qualitative evaluation scores due to improved output cleanliness and structural integrity. Notably, XML parsing success improved from 80% to 90% after DPO.
When to Use This Model
This model is ideal for applications requiring reliable and precisely formatted structured data outputs from an LLM. It's particularly suited for tasks where the output format (e.g., TOML, YAML, XML, JSON, CSV) must adhere strictly to syntax rules and be free from extraneous conversational text or formatting errors. Developers needing an LLM that can consistently generate clean, parseable data will find this model highly beneficial.