taka104/qwen3-4b-dpo-qwen-cot-merged
The taka104/qwen3-4b-dpo-qwen-cot-merged model is a 4 billion parameter instruction-tuned language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) to enhance its reasoning capabilities, specifically Chain-of-Thought (CoT), and improve the quality of structured responses. This model is optimized for tasks requiring logical deduction and well-structured output, making it suitable for applications where coherent and reasoned answers are crucial.
Loading preview...
taka104/qwen3-4b-dpo-qwen-cot-merged Overview
This model is a fine-tuned variant of the Qwen/Qwen3-4B-Instruct-2507 base model, developed by taka104. It leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on improving reasoning and structured response quality.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, allowing for more logical and step-by-step problem-solving.
- Improved Response Quality: DPO fine-tuning aims to produce higher quality and more aligned outputs based on preference datasets.
- Direct Use: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with
transformers.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot. The model is released under the MIT License, with compliance required for the original base model's license terms.