Model Overview
KSIMNB/dpo-qwen-cot-merged is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to align its outputs with preferred responses. This model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning (Chain-of-Thought): Optimized through DPO to improve the model's ability to generate logical and step-by-step reasoning processes.
- Improved Structured Responses: Focuses on producing higher quality and more aligned structured outputs based on preference data.
- Direct Use: As a fully merged model, it can be directly integrated and used with the
transformers library without additional configuration for LoRA adapters.
Training Details
The model underwent a single epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The training utilized a maximum sequence length of 1024 tokens. The preference dataset used for training is u-10bei/dpo-dataset-qwen-cot.
When to Use This Model
This model is particularly suitable for use cases where reasoning quality and alignment to preferred response styles are critical. It is ideal for applications requiring coherent, structured, and logically sound text generation, especially in scenarios benefiting from Chain-of-Thought capabilities.