Model Overview
This model, seibergwitten/dpo-qwen-cot-merged.ver0, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to achieve improved response alignment.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more structured and logical outputs.
- Aligned Responses: Fine-tuned with DPO to align its generated text with preferred outputs, leading to higher quality and more desirable responses.
- Structured Output: Focuses on improving the quality of structured responses based on a preference dataset.
- Direct Usage: Provided as a full-merged 16-bit weights model, eliminating the need for adapter loading and allowing direct use with the
transformers library.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It was configured with a maximum sequence length of 1024 and utilized a LoRA configuration (r=8, alpha=16) which has been merged into the base weights.
Good For
- Applications requiring models with strong reasoning and logical flow.
- Tasks where response quality and alignment to specific preferences are critical.
- Generating structured and coherent text outputs.