Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-toml-xml-yaml-dpo

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 8, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-toml-xml-yaml-dpo is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned with Direct Preference Optimization (DPO) for enhanced structured data generation. It specializes in producing clean, well-formed outputs in formats like TOML, YAML, XML, JSON, and CSV, avoiding common errors like incorrect formatting or extraneous text. This model is optimized for developers requiring reliable and structured data outputs from an LLM.

Loading preview...

What is Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-toml-xml-yaml-dpo?

This model is a 4 billion parameter Qwen3-based language model that has undergone extensive fine-tuning, specifically using Direct Preference Optimization (DPO). It builds upon a base Qwen3-4B-Instruct model, followed by an SFT (Structured data generation/conversion with Chain-of-Thought) phase, and two rounds of DPO. The second DPO round, which defines this model, focuses on format-specific preference optimization for structured data.

Key Capabilities & Optimizations

  • Enhanced Structured Data Generation: Optimized to produce clean and correctly formatted outputs in TOML, YAML, XML, JSON, and CSV.
  • Error Avoidance: Explicitly trained to reject common issues such as unclosed XML tags, unescaped characters, incorrect YAML indentation, JSON wrapped in codeblocks, and mixed explanation text.
  • Improved Output Cleanliness: Significantly reduces instances of markdown codeblock wrapping, mixed explanation text, and language mixing compared to its baseline.
  • Chain-of-Thought (CoT) Reasoning: Generates structured CoT before each output, improving the structural accuracy of generated data.
  • Factual Error Avoidance: Generates fictional/synthetic data to prevent factual inaccuracies often seen when attempting real-world data generation.

Performance Highlights

While the overall parse success rate on a public benchmark is comparable to the baseline (91.3% vs 92.0%), this model achieves higher qualitative evaluation scores due to improved output cleanliness and structural integrity. Notably, XML parsing success improved from 80% to 90% after DPO.

When to Use This Model

This model is ideal for applications requiring reliable and precisely formatted structured data outputs from an LLM. It's particularly suited for tasks where the output format (e.g., TOML, YAML, XML, JSON, CSV) must adhere strictly to syntax rules and be free from extraneous conversational text or formatting errors. Developers needing an LLM that can consistently generate clean, parseable data will find this model highly beneficial.