KeigoK/dpo-qwen-cot-merged
KeigoK/dpo-qwen-cot-merged is a 4 billion parameter language model fine-tuned from Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought (CoT), and improve the quality of structured responses. It leverages a preference dataset for alignment, making it suitable for applications requiring precise and well-reasoned outputs.
Loading preview...
Model Overview
KeigoK/dpo-qwen-cot-merged is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, integrating the full-merged 16-bit weights without requiring adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning abilities.
- Structured Response Quality: Aligned with preferred outputs to generate higher quality and more structured responses.
- Direct Use: As a merged model, it can be used directly with the
transformerslibrary for inference.
Training Details
The model underwent a single epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset.
Licensing
This model operates under the MIT License, as per the terms of its training dataset. Users are also required to comply with the original base model's license terms.