duong942001/dpo-qwen-cot-merged-pa-ad
The duong942001/dpo-qwen-cot-merged-pa-ad model is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO). It is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought (CoT), and structured response quality. This model provides full-merged 16-bit weights, making it suitable for direct deployment in applications requiring enhanced logical coherence and aligned outputs.
Loading preview...
Model Overview
This model, duong942001/dpo-qwen-cot-merged-pa-ad, is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improved alignment with preferred outputs.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured responses.
- Preference Alignment: Fine-tuned with DPO to align its outputs with desired response patterns, based on a preference dataset.
- Direct Use: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and allowing direct integration with
transformers.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.4, using a maximum sequence length of 1536. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset. The LoRA configuration (r=8, alpha=16) was merged into the base model during the fine-tuning process.
Licensing
This model is released under the MIT License, consistent with the terms of its training data. Users must also adhere to the license terms of the original base model, Qwen/Qwen3-4B-Instruct-2507.