Overview
This model, q-hisa/dpo-qwen-cot-merged-v5, is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has been meticulously fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its responses with preferred outputs. The training process involved an initial Supervised Fine-Tuning (SFT) phase using a LoRA adapter, followed by DPO to further refine its capabilities.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, enabling more structured and logical problem-solving.
- Improved Structured Responses: Designed to produce higher quality, more coherent, and aligned outputs based on preference datasets.
- Direct Preference Optimization (DPO): Utilizes DPO to align model behavior with human preferences, leading to more desirable and helpful responses.
- Full-Merged Weights: The repository provides full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
Training Details
The model underwent DPO training for 1 epoch with a learning rate of 1e-07 and a beta value of 0.05, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot. The model is released under the MIT License, with users also required to comply with the original base model's license terms.
Good For
- Applications requiring robust reasoning capabilities.
- Generating structured and high-quality text responses.
- Tasks where alignment with preferred outputs is crucial.