naru0411/LLM-competition-DPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The naru0411/LLM-competition-DPO is a 4 billion parameter Qwen3-4B-Instruct-2507 based model, fine-tuned using Direct Preference Optimization (DPO) with a 40960 token context length. It is specifically optimized to suppress verbose reasoning and enforce strict structured output compliance, such as direct JSON or TOML generation without preambles. This model excels at producing clean, parseable data outputs, preventing common parse errors in structured data generation tasks.

Loading preview...

naru0411/LLM-competition-DPO: Structured Output Optimization

This model is a 4 billion parameter variant of Qwen/Qwen3-4B-Instruct-2507, fine-tuned using Direct Preference Optimization (DPO). Its primary distinction lies in its training objective, which diverges from typical Chain-of-Thought (CoT) tuning.

Key Capabilities

  • Suppresses Verbose Reasoning: Unlike models that provide step-by-step thought processes, this model is designed to output directly without preambles like "Approach:" or "Here is the code."
  • Strict Structured Output Compliance: Optimized to generate clean, parseable structured data formats such as JSON or TOML, minimizing parse errors.
  • Efficient Data Generation: Ideal for applications requiring direct, unadorned data outputs from the LLM.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-6 and a beta value of 0.05, which applies a strict penalty for deviations from the chosen data. It utilizes a maximum sequence length of 2048 tokens during training and incorporates LoRA configuration (r=16, alpha=32) merged into the base model. The training data used was [u-10bei/dpo-dataset-qwen-cot].

Good For

  • Automated Data Extraction: Generating JSON or TOML outputs directly for programmatic consumption.
  • API Integration: LLM-powered applications that require clean, structured responses without conversational filler.
  • Reducing Post-Processing: Minimizing the need to parse or clean LLM outputs before use.