matsunya/dpo_qwen_cot_merged
The matsunya/dpo_qwen_cot_merged model is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. It is designed for applications requiring high-quality, aligned outputs in reasoning tasks.
Loading preview...
Model Overview
matsunya/dpo_qwen_cot_merged is a 4 billion parameter language model, derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.
Key Optimizations
- Enhanced Reasoning: The primary objective of this DPO fine-tuning was to improve the model's reasoning abilities, particularly in generating Chain-of-Thought (CoT) responses.
- Structured Output Quality: Optimization also focused on aligning the model's outputs with preferred responses to enhance the overall quality and structure of its generated text.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 3e-05 and a beta value of 0.1. It utilized a maximum sequence length of 2048 tokens. The training data for preference optimization was sourced from the u-10bei/dpo-dataset-qwen-cot dataset.
Usage Considerations
As a fully merged model, it can be directly integrated and used with the transformers library. Users must adhere to the MIT License of the training data and the original base model's license terms.