Model Overview
This model, dpo-qwen-cot-merged, is a 4 billion parameter variant of the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs.
Key Capabilities
- Enhanced Reasoning (Chain-of-Thought): Optimized to improve the model's ability to generate step-by-step reasoning processes.
- Improved Structured Responses: Focuses on producing higher quality and more coherent structured outputs.
- DPO Fine-tuning: Utilizes DPO with a preference dataset to guide response generation towards desired characteristics.
- Merged Weights: Contains full 16-bit merged weights, eliminating the need for adapter loading and simplifying deployment with
transformers.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length during training was 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used for DPO was sourced from [u-10bei/dpo-dataset-qwen-cot].
Licensing
This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the license terms of the original base model, Qwen3-4B-Instruct-2507.