Model Overview
Tamata1208/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base model. It has been further fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to enhance its performance.
Key Capabilities
- Enhanced Reasoning (Chain-of-Thought): The model is specifically optimized to improve its ability to generate detailed, step-by-step reasoning processes, making it suitable for complex problem-solving.
- Improved Structured Responses: Through DPO training, the model aligns its outputs with preferred formats, leading to higher quality and more consistent structured responses.
- Direct Use: This repository provides the full-merged 16-bit weights, meaning no adapter loading is required for deployment, simplifying integration into existing workflows.
Training Details
The model underwent 2 epochs of DPO training with a learning rate of 5e-06 and a beta value of 0.1. It utilized a maximum sequence length of 2048 tokens and incorporated LoRA (r=8, alpha=16) which has been merged into the base model. The training data used for DPO was [u-10bei/dpo-dataset-qwen-cot].
Licensing
This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the license terms of the original Qwen base model.