nisiwaki/dpo-qwen-cot-merged_01
The nisiwaki/dpo-qwen-cot-merged_01 model is a 4 billion parameter Qwen3-4B-Instruct-2507 variant, fine-tuned by nisiwaki using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. It features a 40960 token context length and is designed for direct use with transformers, requiring no adapter loading.
Loading preview...
Overview
This model, nisiwaki/dpo-qwen-cot-merged_01, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its 16-bit weights fully merged into the base model for direct deployment.
Key Capabilities
- Enhanced Reasoning: Specifically optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring logical progression and structured thinking.
- Improved Response Quality: DPO fine-tuning aligns the model's outputs with preferred responses, leading to higher quality and more aligned generations.
- Direct Use: As a fully merged model, it can be used directly with the
transformerslibrary without the need for separate adapter loading.
Training Details
The model was trained using a SFT + DPO approach over 3 epochs for each stage, with a learning rate of 1e-05 and a DPO beta of 0.1. The maximum sequence length during training was 1024 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, and the model operates under an MIT License, with users also required to comply with the original base model's license terms.