Model Overview
This model, keijiban3/dpo-qwen-cot-merged, is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, featuring approximately 0.5 billion parameters and a context length of 32768 tokens. It has been optimized using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapter loading.
Key Capabilities
- Enhanced Reasoning: Specifically trained to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring multi-step logical deduction.
- Structured Response Quality: Optimized to align responses with preferred outputs, leading to more coherent and structured generations.
- DPO Fine-tuning: Leverages DPO to refine model behavior based on preference datasets, aiming for higher quality and more aligned outputs.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Good For
- Applications requiring improved logical reasoning and step-by-step explanations.
- Generating structured outputs that adhere to specific formats or preferences.
- Tasks where response quality and alignment with human preferences are critical.