Model Overview
The hifill/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapters.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
- Structured Response Generation: Aligned to produce higher quality, more structured outputs based on preferred response patterns.
- DPO Fine-tuning: Leverages DPO to align model behavior with human preferences, focusing on specific response characteristics.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta of 0.1. It utilized a maximum sequence length of 1024 and incorporated LoRA configuration (r=8, alpha=16) which was subsequently merged into the base model. The training data used for DPO was u-10bei/dpo-dataset-qwen-cot.
When to Use This Model
This model is particularly well-suited for applications where improved reasoning, logical coherence, and structured output are critical. It can be beneficial for tasks requiring detailed explanations, step-by-step problem-solving, or generating responses that adhere to specific formats. Users should be aware that the model's license follows the MIT License, as per the dataset terms, and compliance with the original base model's license is also required.