mutsumutsu/dpo-qwen-cot-merged-260205-tokenchg2024-1024
mutsumutsu/dpo-qwen-cot-merged-260205-tokenchg2024-1024 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. This model is designed for applications requiring improved logical coherence and well-formed outputs.
Loading preview...
Model Overview
This model, mutsumutsu/dpo-qwen-cot-merged-260205-tokenchg2024-1024, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a full-merged 16-bit weight model that requires no adapter loading.
Key Optimizations
The primary objective of this fine-tuning was to align the model's responses with preferred outputs, with a specific focus on:
- Enhanced Reasoning: Improving Chain-of-Thought (CoT) capabilities.
- Structured Response Quality: Generating more coherent and well-formed outputs based on a preference dataset.
Training Configuration
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Method: DPO (Direct Preference Optimization)
- Epochs: 1
- Learning Rate: 1e-07
- Max Sequence Length: 2048
Intended Use Cases
This model is particularly well-suited for applications where:
- Logical Reasoning is critical, benefiting from its CoT optimization.
- High-Quality, Structured Outputs are required, such as in question-answering, summarization, or content generation tasks demanding clear organization.
Licensing
The model is distributed under the MIT License, consistent with its training data. Users must also adhere to the original base model's license terms.