Chiaki111/dpo-qwen-cot-merged_dpo_v1_l2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Chiaki111/dpo-qwen-cot-merged_dpo_v1_l2 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) with the Unsloth library. This model incorporates full-merged 16-bit weights, eliminating the need for adapter loading. It is optimized for tasks benefiting from DPO fine-tuning, offering enhanced performance based on human preferences.

Loading preview...

Model Overview

Chiaki111/dpo-qwen-cot-merged_dpo_v1_l2 is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. This model has undergone Direct Preference Optimization (DPO), a fine-tuning technique that aligns the model's outputs more closely with human preferences, utilizing the Unsloth library for efficient training.

Key Characteristics

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Fine-tuning Method: Direct Preference Optimization (DPO)
  • Parameter Count: 4 billion parameters
  • Context Length: 40960 tokens (inherited from base model)
  • Weight Format: Full-merged 16-bit weights, which means no adapter loading is required for deployment.

Training Details

The DPO fine-tuning was conducted over 1 epoch with a learning rate of 1e-06 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens.

Intended Use

This model is suitable for applications where a DPO-tuned Qwen3-4B variant is desired, particularly for tasks that benefit from preference-based alignment. Its full-merged weights simplify deployment by removing the need for separate adapter management.