poko75/dpo-qwen-cot-merged
poko75/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning (Chain-of-Thought) and structured response quality. This model, derived from Qwen/Qwen3-4B-Instruct-2507, is optimized for generating aligned and coherent outputs based on preferred datasets. It offers a 40960 token context length and is suitable for tasks requiring improved logical flow and structured answers.
Loading preview...
Model Overview
poko75/dpo-qwen-cot-merged is a 4 billion parameter language model based on the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. This model leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on improving reasoning capabilities and structured response quality.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, leading to more logical and step-by-step problem-solving.
- Improved Structured Responses: Fine-tuned to generate higher quality, more coherent, and structured outputs based on preference data.
- Full-Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta of 0.1, using a maximum sequence length of 1024. The training utilized the [u-10bei/dpo-dataset-qwen-cot] dataset. The base model's license terms and the MIT License of the dataset apply.
Good For
- Applications requiring strong reasoning and logical deduction.
- Tasks where structured and high-quality responses are critical.
- Developers looking for a Qwen3-based model with enhanced alignment and CoT capabilities.