reiwa7/dpo-qwen-cot-merged-s250
The reiwa7/dpo-qwen-cot-merged-s250 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought (CoT), and improve the quality of structured responses. It features a 40960-token context length and is designed for applications requiring improved logical coherence and structured output.
Loading preview...
Model Overview
This model, reiwa7/dpo-qwen-cot-merged-s250, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its LoRA adapters (r=8, alpha=16) fully merged into the base model for direct use without additional loading.
Key Capabilities & Optimization
The primary objective of this DPO fine-tuning was to align the model's responses with preferred outputs, specifically focusing on:
- Enhanced Reasoning: Significant improvement in Chain-of-Thought (CoT) capabilities, allowing for more logical and step-by-step problem-solving.
- Structured Response Quality: Optimization for generating higher quality and more structured outputs based on preference datasets.
Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Methodology: Direct Preference Optimization (DPO)
- Training Data: Utilized the
u-10bei/dpo-dataset-qwen-cotdataset. - Configuration: Trained for 1 epoch with a learning rate of 5e-05 and a beta value of 0.067. The maximum sequence length during training was 1024 tokens.
Usage Considerations
As a fully merged model, it can be directly integrated and used with the transformers library. The model operates under an MIT License, inherited from its training data, and users must also comply with the original base model's license terms.