Takashi-0000/dpo-qwen-cot-merged0
Takashi-0000/dpo-qwen-cot-merged0 is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It utilizes Direct Preference Optimization (DPO) to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. This model is optimized for generating aligned and coherent outputs, making it suitable for tasks requiring improved logical flow and structured answers.
Loading preview...
Model Overview
This model, Takashi-0000/dpo-qwen-cot-merged0, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model.
Key Capabilities
- Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured responses.
- Improved Response Quality: DPO training aligns the model's outputs with preferred examples, enhancing overall response coherence and quality.
- Direct Use: As a full-merged model, it can be used directly with the
transformerslibrary without requiring adapter loading.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data, u-10bei/dpo-dataset-qwen-cot, focused on preference alignment for reasoning and structured outputs. The model operates under an MIT License, with compliance required for the original base model's license terms.