KhaledScience/dpo-qwen-cot-merged
KhaledScience/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based instruction-tuned causal language model developed by KhaledScience. Fine-tuned using Direct Preference Optimization (DPO) with a focus on Chain-of-Thought (CoT) reasoning, it excels at generating structured and aligned responses. This model is optimized for improving reasoning capabilities and overall response quality.
Loading preview...
Model Overview
KhaledScience/dpo-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO) via the Unsloth library to enhance its response quality and alignment.
Key Capabilities
- Improved Reasoning: Specifically optimized to enhance Chain-of-Thought (CoT) reasoning, leading to more structured and logical outputs.
- Aligned Responses: DPO training aligns the model's outputs with preferred examples, improving overall response quality.
- Direct Use: Provided as a full-merged 16-bit model, requiring no adapter loading for direct integration with
transformers.
Training Details
- Methodology: Utilizes DPO with a beta of 0.1 and a learning rate of 1e-07 over 1 epoch.
- Dataset: Trained on the u-10bei/dpo-dataset-qwen-cot preference dataset.
- Context Length: Supports a maximum sequence length of 1024 tokens during training.
Licensing
This model is released under the MIT License, consistent with the terms of its training data. Users must also adhere to the original base model's license terms.