Model Overview
nyannto/dpo-qwen-cot-merged11 is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been meticulously fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapters.
Key Capabilities & Optimization
This model's primary optimization goal was to enhance its ability to generate preferred outputs, specifically focusing on:
- Improved Reasoning (Chain-of-Thought): The DPO training incorporated a preference dataset designed to refine the model's step-by-step reasoning processes.
- Structured Response Quality: It aims to produce more coherent and well-organized outputs based on user preferences.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 2e-05 and a beta value of 0.2. It utilized a maximum sequence length of 1024 during training, with LoRA configurations (r=8, alpha=16) merged into the base model. The training data was sourced from the u-10bei/dpo-dataset-qwen-cot dataset.
Usage & Licensing
As a merged model, it can be directly integrated and used with the transformers library. The model operates under the MIT License, consistent with its training dataset, and users must also adhere to the original base model's license terms.