Model Overview
This model, fieldvalley-llm2025/llm2025_main_merged_dpo03, is a 4 billion parameter language model derived from Qwen/Qwen2.5-7B-Instruct. It was developed as a submission for the 2025 final project main competition, focusing on highly constrained output generation.
Key Capabilities & Training
The model underwent a comprehensive fine-tuning process using Unsloth (LoRA) and a three-stage Direct Preference Optimization (DPO) strategy:
- Supervised Fine-Tuning (SFT): Initial training on high-quality datasets (
daichira/structured-hard-sft-4k, u-10bei/structured_data_with_cot_dataset_512_v4) to enhance instruction following and adherence to various structured formats (JSON, XML, TOML, YAML, CSV). - DPO Rounds 1 & 2: Focused on general preference learning and hallucination suppression.
- DPO Round 3 (This Model's Specialization): The most critical stage, where the model was aggressively trained for only 100 steps to eliminate any non-JSON output. This involved contrasting pure JSON (chosen) against 8 types of rejected outputs, including Markdown fences (
json), generic fences (), conversational preambles ("Here is the JSON..."), postscripts, and Markdown headings. Strict filtering ensured only pure JSON objects ({...}) were used for this final DPO phase.
Primary Differentiator
This model's core strength lies in its strict adherence to JSON output format, making it highly reliable for tasks where clean, parseable structured data is paramount. It is specifically engineered to avoid common LLM tendencies to include explanatory text or Markdown formatting around JSON responses.