Model Overview
This model, sfutenma/dpo-qwen3_4b-cot-merged_v260227-161515, is a 4 billion parameter Qwen3-based language model developed by sfutenma. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, building upon the base model sfutenma/lora_structeval_t_qwen3_4b_v260221-161528. The fine-tuning process focused on aligning the model's responses with preferred outputs, specifically targeting improvements in reasoning (Chain-of-Thought) and the quality of structured responses.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) prompting to improve logical deduction and problem-solving.
- Structured Response Generation: Designed to produce high-quality, well-formatted structured outputs based on preference data.
- DPO Fine-tuning: Leverages Direct Preference Optimization for better alignment with desired response characteristics.
- Merged Weights: Provides full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
Training Details
The model was trained for 5 epochs with a learning rate of 2e-07 and a beta value of 0.03, using a maximum sequence length of 768 tokens. The training data utilized was u-10bei/dpo-dataset-qwen-cot. The base model's LoRA configuration (r=8, alpha=16) was merged during the process.
Ideal Use Cases
This model is particularly well-suited for applications where precise reasoning, logical coherence, and structured output are critical. Developers can integrate it directly using the transformers library for tasks requiring advanced conversational AI with a focus on structured and reasoned responses.