Overview
This model, yuzkawash/dpo-qwen-cot-merged, is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improvements in reasoning and structured response generation. The model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, allowing for more logical and step-by-step problem-solving.
- Improved Structured Responses: Fine-tuned to produce higher quality, more coherent, and well-structured outputs based on preferred examples.
- Direct Use: As a fully merged model, it can be used directly with the
transformers library without additional configuration.
Training Details
The model was trained for 1.5 epochs with a learning rate of 2e-06 and a beta value of 0.2. It utilized a maximum sequence length of 1024 and incorporated LoRA configuration (r=8, alpha=16) which was subsequently merged into the base model. The training data used for DPO was sourced from [u-10bei/dpo-dataset-qwen-cot].
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Complex problem-solving where step-by-step reasoning is crucial.
- Generating structured data or responses that adhere to specific formats.
- Tasks benefiting from improved coherence and logical flow in generated text.