nyannto/dpo-qwen-cot-merged11
The nyannto/dpo-qwen-cot-merged11 is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned using Direct Preference Optimization (DPO) by nyannto. This model is specifically optimized for improving reasoning capabilities through Chain-of-Thought (CoT) and enhancing structured response quality. It leverages a 32768 token context length and is designed for tasks requiring aligned, high-quality outputs.
Loading preview...
Model Overview
nyannto/dpo-qwen-cot-merged11 is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been meticulously fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapters.
Key Capabilities & Optimization
This model's primary optimization goal was to enhance its ability to generate preferred outputs, specifically focusing on:
- Improved Reasoning (Chain-of-Thought): The DPO training incorporated a preference dataset designed to refine the model's step-by-step reasoning processes.
- Structured Response Quality: It aims to produce more coherent and well-organized outputs based on user preferences.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 2e-05 and a beta value of 0.2. It utilized a maximum sequence length of 1024 during training, with LoRA configurations (r=8, alpha=16) merged into the base model. The training data was sourced from the u-10bei/dpo-dataset-qwen-cot dataset.
Usage & Licensing
As a merged model, it can be directly integrated and used with the transformers library. The model operates under the MIT License, consistent with its training dataset, and users must also adhere to the original base model's license terms.