Chiaki111/dpo-qwen-cot-merged_dpo_v1_l2
Chiaki111/dpo-qwen-cot-merged_dpo_v1_l2 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) with the Unsloth library. This model incorporates full-merged 16-bit weights, eliminating the need for adapter loading. It is optimized for tasks benefiting from DPO fine-tuning, offering enhanced performance based on human preferences.
Loading preview...
Model Overview
Chiaki111/dpo-qwen-cot-merged_dpo_v1_l2 is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. This model has undergone Direct Preference Optimization (DPO), a fine-tuning technique that aligns the model's outputs more closely with human preferences, utilizing the Unsloth library for efficient training.
Key Characteristics
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Fine-tuning Method: Direct Preference Optimization (DPO)
- Parameter Count: 4 billion parameters
- Context Length: 40960 tokens (inherited from base model)
- Weight Format: Full-merged 16-bit weights, which means no adapter loading is required for deployment.
Training Details
The DPO fine-tuning was conducted over 1 epoch with a learning rate of 1e-06 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens.
Intended Use
This model is suitable for applications where a DPO-tuned Qwen3-4B variant is desired, particularly for tasks that benefit from preference-based alignment. Its full-merged weights simplify deployment by removing the need for separate adapter management.