SunTaiyo/qwen3-4b-structured-output-dpo
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The SunTaiyo/qwen3-4b-structured-output-dpo is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought and improve the quality of structured responses. It excels at generating aligned and coherent outputs based on preferred datasets, making it suitable for tasks requiring precise and structured text generation.

Loading preview...

Overview

This model, SunTaiyo/qwen3-4b-structured-output-dpo, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to align its responses with preferred outputs.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, leading to more logical and coherent outputs.
  • Structured Response Quality: Specifically trained to generate high-quality structured responses, making it ideal for tasks requiring formatted or constrained output.
  • DPO Fine-tuning: Utilizes DPO with a beta of 0.1 and a learning rate of 1e-07, focusing on aligning model behavior with desired preferences.
  • Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.

Good For

  • Applications requiring models to follow complex reasoning steps.
  • Generating structured data, such as JSON, XML, or other formatted text.
  • Tasks where output alignment with specific preferences is critical.

Training Details

The model was trained for 1 epoch with a maximum sequence length of 1024, using the u-10bei/dpo-dataset-qwen-cot dataset. It incorporates LoRA configuration (r=8, alpha=16) which has been merged into the base model.