OguraHiroyuki/dpo-qwen-cot-mergedv4
OguraHiroyuki/dpo-qwen-cot-mergedv4 is a fine-tuned Qwen3-4B-Instruct-2507 model, optimized using Direct Preference Optimization (DPO) via Unsloth. This 4 billion parameter model focuses on improving reasoning through Chain-of-Thought (CoT) and enhancing structured response quality. It is designed for applications requiring aligned and coherent text generation, particularly in conversational AI and instruction following.
Loading preview...
Model Overview
OguraHiroyuki/dpo-qwen-cot-mergedv4 is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO), implemented with the Unsloth library, to align its outputs with preferred responses.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning abilities.
- Structured Response Quality: Focuses on generating higher quality, more structured outputs based on preference datasets.
- Instruction Following: Designed for better adherence to instructions, making it suitable for conversational and task-oriented AI.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-06 and a beta value of 0.1, using a maximum sequence length of 1024. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset. The LoRA configuration (r=8, alpha=16) was merged into the base model, providing full 16-bit weights without requiring adapter loading.
Usage
This merged model can be directly used with the transformers library, simplifying deployment for inference tasks. It is licensed under the MIT License, with users also required to comply with the original base model's license terms.