Model Overview
This model, reiwa7/dpo-qwen-cot-merged-s250, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its LoRA adapters (r=8, alpha=16) fully merged into the base model for direct use without additional loading.
Key Capabilities & Optimization
The primary objective of this DPO fine-tuning was to align the model's responses with preferred outputs, specifically focusing on:
- Enhanced Reasoning: Significant improvement in Chain-of-Thought (CoT) capabilities, allowing for more logical and step-by-step problem-solving.
- Structured Response Quality: Optimization for generating higher quality and more structured outputs based on preference datasets.
Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Methodology: Direct Preference Optimization (DPO)
- Training Data: Utilized the
u-10bei/dpo-dataset-qwen-cot dataset. - Configuration: Trained for 1 epoch with a learning rate of 5e-05 and a beta value of 0.067. The maximum sequence length during training was 1024 tokens.
Usage Considerations
As a fully merged model, it can be directly integrated and used with the transformers library. The model operates under an MIT License, inherited from its training data, and users must also comply with the original base model's license terms.