Yurori/qwen3-4b-dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Yurori/qwen3-4b-dpo-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model incorporates full-merged 16-bit weights, eliminating the need for adapter loading. It is optimized for improved performance through DPO, making it suitable for tasks requiring refined instruction following and preference alignment.

Loading preview...

Model Overview

Yurori/qwen3-4b-dpo-qwen-cot-merged is a 4 billion parameter language model derived from the Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), a method designed to align model outputs with human preferences more effectively. The fine-tuning process utilized the Unsloth library, known for efficient training.

Key Characteristics

  • Base Model: Qwen/Qwen3-4B-Instruct-2507, a robust foundation for instruction-following tasks.
  • Optimization Method: Direct Preference Optimization (DPO), enhancing the model's ability to generate preferred responses.
  • Weights: Contains full-merged 16-bit weights, meaning no separate adapter loading is required for deployment, simplifying integration.
  • Training Configuration: Fine-tuned for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024 tokens.

Ideal Use Cases

This model is particularly well-suited for applications where:

  • Preference Alignment: Generating responses that closely match desired human preferences or specific output styles is critical.
  • Instruction Following: Improved adherence to complex instructions due to DPO fine-tuning.
  • Efficient Deployment: The merged 16-bit weights offer straightforward integration without additional adapter management.