TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model
TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, optimized using Direct Preference Optimization (DPO) via Unsloth. This 4 billion parameter model focuses on improving reasoning capabilities through Chain-of-Thought (CoT) and enhancing structured response quality. It is designed for applications requiring aligned, high-quality outputs in reasoning tasks.
Loading preview...
Model Overview
This model, TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model, is a specialized variant of the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to align its responses with preferred outputs.
Key Capabilities & Optimization
- Enhanced Reasoning: The primary objective of this DPO fine-tuning was to improve the model's reasoning abilities, particularly through Chain-of-Thought (CoT) processes.
- Structured Response Quality: It is optimized to produce higher quality and more structured responses based on a preference dataset.
- Full-Merged Weights: The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
Training Details
- Methodology: DPO was applied over 1 epoch with a learning rate of 1e-07 and a beta value of 0.1.
- Context Length: The training utilized a maximum sequence length of 1024 tokens.
- LoRA Configuration: LoRA (r=8, alpha=16) was used during training and subsequently merged into the base model.
Intended Use Cases
This model is particularly well-suited for applications where:
- Improved Reasoning is critical, especially for tasks benefiting from Chain-of-Thought prompting.
- High-Quality, Aligned Outputs are required, reflecting preferred response styles and structures.
Licensing
The model's training data is sourced from u-10bei/dpo-dataset-qwen-cot under an MIT License. Users must also comply with the original base model's license terms.